如何使用NaT值正确处理整个DataFrame中的日期时间比较? [英] How to properly handle datetime comparisons in an entire DataFrame with NaT values?
问题描述
当尝试检查 DataFrame
的值是否超过某个日期时,我偶然发现了这种奇怪的行为,而该DataFrame也可能包含 pd。 NaT
I stumbled upon this odd behavior when trying to check if a DataFrame
has values above a certain date, while that DataFrame may also contain pd.NaT
比较值的行为符合预期:
Comparisons of values behaves as expected:
import pandas as pd
pd.NaT > pd.to_datetime('2018-10-15')
# False
与a的比较 Series
的行为也符合预期:
Comparisons with a Series
also behave as expected:
s = pd.Series([pd.NaT, pd.to_datetime('2018-10-16')])
s > pd.to_datetime('2018-10-15')
#0 False
#1 True
#dtype: bool
但是 DataFrame
比较不正确:
s.to_frame() > pd.to_datetime('2018-10-15')
# 0
#0 True
#1 True
在我看来,问题在于比较最初返回的是 NaN
,它(在某个时候被强制)为 True
给出以下行为:
It seems to me the issue is that the comparison initially returns NaN
which is (at some point?) coerced to True
given the behavior of:
df = pd.DataFrame([[pd.NaT, pd.to_datetime('2018-10-16')],
[pd.to_datetime('2018-10-16'), pd.NaT]])
df >= pd.to_datetime('2018-10-15')
# 0 1
#0 True True
#1 True True
df.ge(pd.to_datetime('2018-10-15'))
# 0 1
#0 NaN 1.0
#1 1.0 NaN
所以我们真的不能使用> < > =< =
运算符在比较 DataFrame
时需要依赖 .lt .gt .le。 ge
后跟 .fillna(0)
?
So can we really not use the > < >= <=
operators when comparing for a DataFrame
and need to rely on .lt .gt .le .ge
followed by a .fillna(0)
?
df.ge(pd.to_datetime('2018-10-15')).fillna(0)
# 0 1
#0 0.0 1.0
#1 1.0 0.0
推荐答案
此错误将在下一版熊猫(0.24.0)中修复:
This was a bug that will be fixed in the next release of pandas (0.24.0):
In [1]: import pandas as pd; pd.__version__
Out[1]: '0.24.0.dev0+1504.g9642fea9c'
In [2]: s = pd.Series([pd.NaT, pd.to_datetime('2018-10-16')])
In [3]: s > pd.to_datetime('2018-10-15')
Out[3]:
0 False
1 True
dtype: bool
In [4]: s.to_frame() > pd.to_datetime('2018-10-15')
Out[4]:
0
0 False
1 True
In [5]: df = pd.DataFrame([[pd.NaT, pd.to_datetime('2018-10-16')],
...: [pd.to_datetime('2018-10-16'), pd.NaT]])
...:
In [6]: df >= pd.to_datetime('2018-10-15')
Out[6]:
0 1
0 False True
1 True False
In [7]: df.ge(pd.to_datetime('2018-10-15'))
Out[7]:
0 1
0 False True
1 True False
有关相应的GitHub问题,请参见: https:/ /github.com/pandas-dev/pandas/issues/22242
For the corresponding GitHub issue, see: https://github.com/pandas-dev/pandas/issues/22242
这篇关于如何使用NaT值正确处理整个DataFrame中的日期时间比较?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!