Pandas中的“日期的最大值/最小值"列,列中包含nan值 [英] Max / Min of date column in Pandas, columns include nan values
问题描述
我正在尝试在pandas数据框中创建一个新列,其中最大(或最小)日期来自其他两个日期列.但是,当这些列中的任何一列中的任何地方都存在NAN时,整个min/max列将变为NAN.是什么赋予了?当使用数字列时,这可以很好地工作...但是对于日期,新列是所有NAN.下面是一些示例代码来说明此问题:
df = pd.DataFrame(data=[[np.nan,date(2000,11,1)],
[date(2000,12,1), date(2000,9,1)],
[date(2000,4,1),np.nan],
[date(2000,12,2),np.nan]], columns=['col1','col2'])
df['col3'] = df[['col1','col2']].max(axis=1)
我知道可以使用loc和<,>,isull等组合来完成.但是如何使其与常规的max/min函数一起使用?
您正在将date
对象存储在您的列中,如果转换为datetime
,那么它将按预期工作:
In[10]:
df['col1'] = pd.to_datetime(df['col1'])
df['col2'] = pd.to_datetime(df['col2'])
df
Out[10]:
col1 col2 col3
0 NaT 2000-11-01 NaN
1 2000-12-01 2000-09-01 NaN
2 2000-04-01 NaT NaN
3 2000-12-02 NaT NaN
In[11]:
df['col3'] = df[['col1','col2']].max(axis=1)
df
Out[11]:
col1 col2 col3
0 NaT 2000-11-01 2000-11-01
1 2000-12-01 2000-09-01 2000-12-01
2 2000-04-01 NaT 2000-04-01
3 2000-12-02 NaT 2000-12-02
如果您只是这样做:
df['col3'] = df['col1'].max()
这引起一个TypeError: '>=' not supported between instances of 'float' and 'datetime.date'
NaN
值使dtype
提升为float
,因此返回NaN
.如果没有缺失值,那么它将按预期工作,如果缺失值,则应将dtype
转换为datetime
,以便将缺失值转换为NaT
,以便max
正常工作>
I'm trying to create a new column in a pandas dataframe with the maximum (or minimum) date from two other date columns. But, when there is a NAN anywhere in either of those columns, the whole min/max column becomes a NAN. What gives? When using number columns this works fine... but with dates, the new column is all NANs. Here's some sample code to illustrate the problem:
df = pd.DataFrame(data=[[np.nan,date(2000,11,1)],
[date(2000,12,1), date(2000,9,1)],
[date(2000,4,1),np.nan],
[date(2000,12,2),np.nan]], columns=['col1','col2'])
df['col3'] = df[['col1','col2']].max(axis=1)
I know it can be done with loc and combination of <, >, isnull and so on. But how to make it work with regular max/min functions?
You're storing date
objects in your columns, if you convert to datetime
then it works as expected:
In[10]:
df['col1'] = pd.to_datetime(df['col1'])
df['col2'] = pd.to_datetime(df['col2'])
df
Out[10]:
col1 col2 col3
0 NaT 2000-11-01 NaN
1 2000-12-01 2000-09-01 NaN
2 2000-04-01 NaT NaN
3 2000-12-02 NaT NaN
In[11]:
df['col3'] = df[['col1','col2']].max(axis=1)
df
Out[11]:
col1 col2 col3
0 NaT 2000-11-01 2000-11-01
1 2000-12-01 2000-09-01 2000-12-01
2 2000-04-01 NaT 2000-04-01
3 2000-12-02 NaT 2000-12-02
If you simply did:
df['col3'] = df['col1'].max()
this raises a TypeError: '>=' not supported between instances of 'float' and 'datetime.date'
The NaN
values cause the dtype
to be promoted to float
so NaN
gets returned. If you had no missing values then it would work as expected, if you have missing values then you should convert the dtype
to datetime
so that the missing values are converted to NaT
so that max
works correctly
这篇关于Pandas中的“日期的最大值/最小值"列,列中包含nan值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!