pandas read_csv用字符串'nan'填充空值,而不是解析日期 [英] Pandas read_csv fills empty values with string 'nan', instead of parsing date

查看:113
本文介绍了 pandas read_csv用字符串'nan'填充空值,而不是解析日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将np.nan分配给DataFrame列中的缺失值.然后使用to_csv将DataFrame写入csv文件.如果我使用文本编辑器打开文件,则生成的csv文件正确地在逗号之间没有缺少的值.但是,当我使用read_csv将csv文件读回到DataFrame中时,缺少的值变成字符串'nan'而不是NaN.结果,isnull()不起作用.例如:

I assign np.nan to the missing values in a column of a DataFrame. The DataFrame is then written to a csv file using to_csv. The resulting csv file correctly has nothing between the commas for the missing values if I open the file with a text editor. But when I read that csv file back into a DataFrame using read_csv, the missing values become the string 'nan' instead of NaN. As a result, isnull() does not work. For example:

In [13]: df
Out[13]: 
   index  value date
0    975  25.35  nan
1    976  26.28  nan
2    977  26.24  nan
3    978  25.76  nan
4    979  26.08  nan

In [14]: df.date.isnull()
Out[14]: 
0    False
1    False
2    False
3    False
4    False

我做错什么了吗?我是否应该为丢失的值分配其他值而不是np.nan以便isnull()能够使用?

Am I doing anything wrong? Should I assign some other values instead of np.nan to the missing values so that the isnull() would be able to pick up?

对不起,忘记了我还设置了parse_dates = [2]来解析该列.该列包含一些缺少行的日期.我希望缺少的行是NaN.

Sorry, forgot to mention that I also set parse_dates = [2] to parse that column. That column contains dates with some rows missing. I would like to have the missing rows be NaN.

EIDT:我刚刚发现问题确实是由于parse_dates造成的.如果date列中包含缺少的值,则read_csv将不会解析该列.而是将日期作为字符串读取,并将字符串'nan'分配给空值.

EIDT: I just found out that the issue is really due to parse_dates. If the date column contains missing values, read_csv will not parse that column. Instead, it will read the dates as string and assign the string 'nan' to the empty values.

In [21]: data = pd.read_csv('test.csv', parse_dates = [1])

In [22]: data
Out[22]: 
   value      date id
0      2  2013-3-1  a
1      3  2013-3-1  b
2      4  2013-3-1  c
3      5       nan  d
4      6  2013-3-1  d

In [23]: data.date[3]
Out[23]: 'nan'

pd.to_datetime也不起作用:

pd.to_datetime does not work either:

In [12]: data
Out[12]: 
   value      date id
0      2  2013-3-1  a
1      3  2013-3-1  b
2      4  2013-3-1  c
3      5       nan  d
4      6  2013-3-1  d

In [13]: data.dtypes
Out[13]: 
value     int64
date     object
id       object

In [14]: pd.to_datetime(data['date'])
Out[14]: 
0    2013-3-1
1    2013-3-1
2    2013-3-1
3         nan
4    2013-3-1
Name: date

有没有办法让read_csv parse_dates处理包含缺失值的列? IE.为NaN分配缺失值,并且仍然解析有效日期?

Is there a way to have read_csv parse_dates to work with columns that contain missing values? I.e. assign NaN to missing values and still parse the valid dates?

推荐答案

当前是解析器中的buglet,请参见: https://github.com/pydata/pandas/issues/3062 一个简单的解决方法是在您读入列后强制转换该列(并将在Nas中填充NaT,NaT是非时间标记,相当于日期时间为nan).这应该适用于0.10.1

This is currently a buglet in the parser, see: https://github.com/pydata/pandas/issues/3062 easy workaround is to force convert the column after your read it in (and will populate the nans with NaT, which is the Not-A-Time marker, equiv to nan for datetimes). This should work on 0.10.1

In [22]: df
Out[22]: 
   value      date id
0      2  2013-3-1  a
1      3  2013-3-1  b
2      4  2013-3-1  c
3      5       NaN  d
4      6  2013-3-1  d

In [23]: df.dtypes
Out[23]: 
value     int64
date     object
id       object
dtype: object

In [24]: pd.to_datetime(df['date'])
Out[24]: 
0   2013-03-01 00:00:00
1   2013-03-01 00:00:00
2   2013-03-01 00:00:00
3                   NaT
4   2013-03-01 00:00:00
Name: date, dtype: datetime64[ns]

如果字符串nan最终出现在您的数据中,则可以执行以下操作:

If the string 'nan' acutally appears in your data, you can do this:

In [31]: s = Series(['2013-1-1','2013-1-1','nan','2013-1-1'])

In [32]: s
Out[32]: 
0    2013-1-1
1    2013-1-1
2         nan
3    2013-1-1
dtype: object

In [39]: s[s=='nan'] = np.nan

In [40]: s
Out[40]: 
0    2013-1-1
1    2013-1-1
2         NaN
3    2013-1-1
dtype: object

In [41]: pandas.to_datetime(s)
Out[41]: 
0   2013-01-01 00:00:00
1   2013-01-01 00:00:00
2                   NaT
3   2013-01-01 00:00:00
dtype: datetime64[ns]

这篇关于 pandas read_csv用字符串'nan'填充空值,而不是解析日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆