这是带有 notnull() 的 Pandas 错误还是我的根本误解(可能是误解) [英] Is this a Pandas bug with notnull() or a fundamental misunderstanding on my part (probably misunderstanding)

查看:54
本文介绍了这是带有 notnull() 的 Pandas 错误还是我的根本误解(可能是误解)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含两列和默认索引的 Pandas 数据框.第一列是字符串,第二列是日期.最上面的日期是 NaN(虽然它真的应该是 NaT).

I have a pandas dataframe with two columns and default indexing. The first column is a string and the second is a date. The top date is NaN (though it should be NaT really).

index    somestr    date
0        ON         NaN
1        1C         2014-06-11 00:00:00
2        2C         2014-07-09 00:00:00
3        3C         2014-08-13 00:00:00
4        4C         2014-09-10 00:00:00
5        5C         2014-10-08 00:00:00
6        6C         2014-11-12 00:00:00
7        7C         2014-12-10 00:00:00
8        8C         2015-01-14 00:00:00
9        9C         2015-02-11 00:00:00
10       10C        2015-03-11 00:00:00
11       11C        2015-04-08 00:00:00
12       12C        2015-05-13 00:00:00

将此数据帧称为 df.

Call this dataframe df.

当我跑步时:

df[pd.notnull(df['date'])]

我希望第一行消失.它没有.如果我通过设置删除带有字符串的列:

I expect the first row to go away. It doesn't. If I remove the column with string by setting:

df=df[['date']]

然后申请:

df[pd.notnull(df['date'])]

然后带有空值的第一行确实消失了.

then the first row with the null does go away.

此外,如果所有列都是数字/日期类型,则带有 null 的行总是会消失.当出现带有字符串的列时,就会出现这个问题.

Also, the row with the null always goes away if all columns are number/date types. When a column with a string appears, this problem occurs.

这肯定是一个错误,对吧?我不确定其他人是否能够复制这一点.这是我的 Enthought Canopy for Windows(我对 UNIX/Linux 命令行噪音不够聪明)

Surely this is a bug, right? I am not sure if others will be able to replicate this. This was on my Enthought Canopy for Windows (I am not smart enough for UNIX/Linux command line noise)

根据 Jeff 和 unutbu 的以下请求:@ubuntu -

Per requests below from Jeff and unutbu: @ubuntu -

df.dtypes
somestr    object
date       object
dtype:  object

还有:

type(df.iloc[0]['date'])
pandas.tslib.NaTType

在代码中,该列被明确指定为 pd.NaT我也不明白为什么它应该说 NaT 时说 NaN.当我使用这个玩具框架时,我使用的过滤效果很好:

In the code this column was specifically assigned as pd.NaT I also do not understand why it says NaN when it should say NaT. The filtering I used worked fine when I used this toy frame:

df=pd.DataFrame({'somestr' : ['aa', 'bb'], 'date' : [pd.NaT, dt.datetime(2014,4,15)]}, columns=['somestr', 'date'])

还需要注意的是,虽然上表的输出中有NaN,但下面的输出是NaT:

It should also be noted that although the table above had NaN in the output, the following output NaT:

df['date'][0]
NaT

还有:

pd.notnull(df['date'][0])
False

pd.notnull(df['date'][1])
True

但是....在评估数组时,它们都返回 True - 奇怪...

but....when evaluating the array, they all came back True - bizarre...

np.all(pd.notnull(df['date']))
True

@Jeff - 这是 0.12.我坚持这一点.该框架是通过连接两个不同的框架来创建的,这些框架是使用 psql 从数据库查询中获取的.然后通过我所做的计算添加了日期和其他一些浮点列.当然,我过滤到了在这里有意义的两个相关列,直到我查明字符串值列导致了问题.

@Jeff - this is 0.12. I am stuck with this. The frame was created by concatenating two different frames that were grabbed from database queries using psql. The date and some other float columns were then added by calculations I did. Of course, I filtered to the two relevant columns that made sense here until I pinpointed that the string valued columns were causing problems.

************ 如何复制**********

import pandas as pd
import datetime as dt

print(pd.__version__)
# 0.12.0

df = pd.DataFrame({'somestr': ['aa', 'bb'], 'date': ['cc', 'dd']},
                  columns=['somestr', 'date'])
df['date'].iloc[0] = pd.NaT
df['date'].iloc[1] = pd.to_datetime(dt.datetime(2014, 4, 15))
print(df[pd.notnull(df['date'])])
#   somestr                 date
# 0      aa                  NaN
# 1      bb  2014-04-15 00:00:00

df2 = df[['date']]
print(df2[pd.notnull(df2['date'])])
#                  date
# 1 2014-04-15 00:00:00

因此,此数据框最初包含所有字符串条目 - 然后将日期列转换为顶部带有 NaT 的日期 - 请注意,在表中它是 NaN,但是在使用 df.iloc[0]['date' 时] 你确实看到了 NaT.使用上面的代码片段,您可以看到使用和不使用 somestr 列的 not null 过滤都很奇怪.同样 - 这是带有 Pandas 0.12 和 NumPy 1.8 的适用于 Windows 的 Enthought Canopy.

So, this dataframe originally had all string entries - then the date column was converted to dates with an NaT at the top - note that in the table it is NaN, but when using df.iloc[0]['date'] you do see the NaT. Using the snippet above, you can see that the filtering by not null is bizarre with and without the somestr column. Again - this is Enthought Canopy for Windows with Pandas 0.12 and NumPy 1.8.

推荐答案

我也遇到了这个问题.这是我修复它的方法."isnull()" 是一个函数,用于检查某些内容是否为 NaN 或为空.~"(波浪号)运算符否定以下表达式.所以我们说从你的原始数据框中给我一个数据框,但只有在数据"行不为空的地方.

I encountered this problem also. Here's how I fixed it. "isnull()" is a function that checks if something is NaN or empty. The "~" (tilde) operator negates the following expression. So we are saying give me a dataframe from your original dataframe but only where the 'data' rows are NOT null.

df = df[~df['data'].isnull()]

希望这会有所帮助!

这篇关于这是带有 notnull() 的 Pandas 错误还是我的根本误解(可能是误解)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆