使用Pandas Groupby和应用功能时处理None值 [英] Dealing with None values when using Pandas Groupby and Apply with a Function
问题描述
我在Pandas
中有一个Dataframe
,其中有一个字母和两个日期作为列.我想使用shift()
计算上一行的两个日期列之间的工作日,前提是Letter
值相同(使用.groupby()
).我正在使用.apply()
进行此操作.这一直有效,直到我传递了一些缺少某个日期的数据.我将所有内容移到一个函数中,以使用try/except
子句处理缺少的值,但是现在我的函数对所有内容均返回NaN
.看来日期的None
值会影响该函数的每次调用,而我认为只有在.groupby()
中的Letter
为A
时,它才会起作用.
I have a Dataframe
in Pandas
with a letter and two dates as columns. I would like to calculate the business days between the two date columns for the previous row using shift()
, provided that the Letter
value is the same (using a .groupby()
). I was doing this with .apply()
. This worked until I passed in some data in which one of the dates was missing. I moved everything to a function to handle the missing value with a try/except
clause, but now my function returns NaN
for everything. It appears the None
value for date is impacting each call of the function, whereas I would think it would only do it when the Letter
from the .groupby()
is A
.
import pandas as pd
from datetime import datetime
import numpy as np
def business_days(x):
try:
return pd.DataFrame(np.busday_count(x['First Date'].tolist(), x['Last Date'].tolist())).shift().reset_index(drop=True)
except ValueError:
return None
df = pd.DataFrame(data=[['A', datetime(2016, 1, 7), None],
['A', datetime(2016, 3, 1), datetime(2016, 3, 8)],
['B', datetime(2016, 5, 1), datetime(2016, 5, 10)],
['B', datetime(2016, 6, 5), datetime(2016, 6, 7)]],
columns=['Letter', 'First Date', 'Last Date'])
df['First Date'] = df['First Date'].apply(lambda x: x.to_datetime().date())
df['Last Date'] = df['Last Date'].apply(lambda x: x.to_datetime().date())
df['Gap'] = df.groupby('Letter').apply(business_days)
print df
实际输出:
Letter First Date Last Date Gap
0 A 2016-01-07 NaT NaN
1 A 2016-03-01 2016-03-08 NaN
2 B 2016-05-01 2016-05-10 NaN
3 B 2016-06-05 2016-06-07 NaN
所需的输出:
Letter First Day Last Day Gap
0 A 2016-01-07 NAT NAN
1 A 2016-03-01 2016-03-08 NAN
2 B 2016-05-01 2016-05-10 NAN
3 B 2016-06-05 2016-06-07 7
推荐答案
-
暂时忽略
NaT
,请注意np.busday_count
计算 可以在df
之前的整个列上应用groupby
完成.这将 节省时间,因为这可以替换许多对np.busday_count
的调用(每次调用一次) 组),只需调用np.busday_count
.一个函数调用应用于 大数组通常比小数组上的许多函数调用要快.Ignoring the
NaT
s for the moment, note that thenp.busday_count
calculation can be done on whole columns ofdf
before applyinggroupby
. This will save time since this replaces many calls tonp.busday_count
(one for each group) with a single call tonp.busday_count
. One function call applied to a large array is generally faster than many function calls on small arrays.要处理
NaT
,可以使用pd.notnull
标识哪些行 具有NaT
s并屏蔽First Date
s和Last Date
s,以便仅有效 日期发送到np.busday_count
.然后,您可以为这些填写NaN
日期具有NaT
s的行.To handle the
NaT
s, you could usepd.notnull
to identify the rows which haveNaT
s and mask theFirst Date
s andLast Date
s so that only valid dates are sent tonp.busday_count
. You can then fill inNaN
s for those rows where the dates hadNaT
s.计算完所有工作日计数后,我们要做的只是分组
Letter
和将值向下移动一.那可以做到groupby/transform('shift')
.After we calculate all the business day counts, all we need to do is group by
Letter
and shift the values down by one. That can be done withgroupby/transform('shift')
.import datetime as DT import numpy as np import pandas as pd def business_days(start, end): mask = pd.notnull(start) & pd.notnull(end) start = start.values.astype('datetime64[D]')[mask] end = end.values.astype('datetime64[D]')[mask] result = np.empty(len(mask), dtype=float) result[mask] = np.busday_count(start, end) result[~mask] = np.nan return result df = pd.DataFrame(data=[['A', DT.datetime(2016, 1, 7), None], ['A', DT.datetime(2016, 3, 1), DT.datetime(2016, 3, 8)], ['B', DT.datetime(2016, 5, 1), DT.datetime(2016, 5, 10)], ['B', DT.datetime(2016, 6, 5), DT.datetime(2016, 6, 7)]], columns=['Letter', 'First Date', 'Last Date']) df['Gap'] = business_days(df['First Date'], df['Last Date']) print(df) # Letter First Date Last Date Gap # 0 A 2016-01-07 NaT NaN # 1 A 2016-03-01 2016-03-08 5.0 # 2 B 2016-05-01 2016-05-10 6.0 # 3 B 2016-06-05 2016-06-07 1.0 df['Gap'] = df.groupby('Letter')['Gap'].transform('shift') print(df)
打印
Letter First Date Last Date Gap 0 A 2016-01-07 NaT NaN 1 A 2016-03-01 2016-03-08 NaN 2 B 2016-05-01 2016-05-10 NaN 3 B 2016-06-05 2016-06-07 6.0
这篇关于使用Pandas Groupby和应用功能时处理None值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!