Python Pandas Mixed Boolean是/ True和NaN列 [英] Python Pandas Mixed Boolean Yes/True and NaN Columns

查看:263
本文介绍了Python Pandas Mixed Boolean是/ True和NaN列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一个健康科学课程,建议使用R或Stata。我正在尝试使用Python / Numpy / Pandas,因为我希望将来用它来进行金融时间序列分析。

I am doing a health science course where R or Stata are recommended. I'm trying to use Python / Numpy / Pandas instead as I wish to use it in future for financial time series analysis.

数据是Stata格式,因此我复制了字段并将其保存为 CSV
所有字段导入都很好,除了有一些Yes / No列,其中一些列有空白字段。

The data was Stata format so I copied the fields and saved them as a CSV. All fields imports are fine except that there are a number of columns of Yes/No some of which have blank fields.

导入命令是

fhs = pd.io.parsers.read_csv('F:\\BioStatistics\\fds\\fhs_c2.csv', header=0, index_col=0)

如果有空白字段,则dtype为object(是有道理的)

If there is a blank field the dtype is object (makes sense)

如果没有空白,某些列会转换为 TRUE / FALSE ,其他列将保留为是/否但是dtype是bool。知道为什么吗?

If there are no blanks some columns convert to TRUE/FALSE, others leave as Yes/No but dtype is bool. Any idea why?

我希望所有人都能通过一个dtype表达一种观察+统计分析的方式。

I want all to by one dtype and expressed one way for viewing + stat analysis.

我通过在开头添加一行来为没有空格的布尔列添加一行来实现这一点 - 所以一切都变成了一个对象。然后我使用
fhs = fhs.drop([1002])来删除该行,数据类型仍然很好。

I have achieve this by adding a row at the beginning with blank cells for the Boolean columns that had no spaces - so everything becomes an object. Then I use fhs = fhs.drop([1002]) to drop that row and data types are still good.

我喜欢在没有这一行的情况下保存它只是能够每次使用正确类型加载数据但是不知道是否有可能当某些列具有全部是或否,有些会有空白单元格。有可能吗?

I'd love to save it without this row and just be able to load the data each time with "correct" types but don't know if it possible when some of the columns will have all yes or no, and some will have blank cells. Is it possible?

谢谢,抱歉新手问题。

导入

      C1    C2    C3

R1   Yes   Yes    No

R2    No    No    No

R3   Yes         Yes

R4   Yes   Yes   Yes

第一列进入df为是,否,是,是键入bool xxxx以下

first column comes into df as Yes, No, Yes, Yes type bool xxxx below

第二列进入df是,否,NaN,是类型对象

2nd column comes into df as Yes, No, NaN, Yes type object

第3列进入df为FALSE,FALSE,TRUE,TRUE类型bool

3rd column comes into df as FALSE, FALSE, TRUE, TRUE type bool

该死的。刚检查过。我错了。如果是yes和no那么列类型是object。

Damn. Just checked. I was wrong. If its yes and no then the column type is object.

我想在导入时告诉它以使它们成为对象并坚持使用yes和no,因为:
1.我认为第二列必须是对象(否则我认为是混合)
2.数据集是/否,其他类成员将看是和否

I'd like to tell it when importing to make them all object and stick with yes and no because: 1. I think the 2nd column must be object (as its mixed otherwise i think) 2. The data set is in yes / no and other class members will be looking at yes and no

这是我的数据:链接

以下是代码:

来自pandas import *

导入numpy为np

导入pandas为pd

Here's the code:
from pandas import *
import numpy as np
import pandas as pd

def convert_bool(col):
    if str(col).title() ==  "True": #check for nan
        return "Yes"
    elif str(col).title() == "False":
        return "No"
    else:
        return col

fhs = pd.read_csv('F:\\BioStatistics\\fds\\StatExport.csv', converters={"death": lambda x:convert_bool(x)}, header=0, index_col=0)  

和输出链接

推荐答案

您可以使用pandas.read_csv中的转换器字段

You can use the converters field from pandas.read_csv

def convert_bool(col):
    if str(col).title() ==  "True": #check for nan
        return "YES"
    elif str(col).title() == "False":
        return "NO"
    else:
        return col
pandas.read_csv(file_in, converters={"C3": lambda x:convert_bool(x)})

这篇关于Python Pandas Mixed Boolean是/ True和NaN列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆