大 pandas 时间累计总和按组 [英] Pandas temporal cumulative sum by group
问题描述
我有一个数据框,每个id记录一个或多个事件。对于每个事件,记录ID,度量x和日期。像这样:
I have a data frame where 1 or more events are recorded for each id. For each event the id, a metric x and a date are recorded. Something like this:
import pandas as pd
import datetime as dt
import numpy as np
x = range(0, 6)
id = ['a', 'a', 'b', 'a', 'b', 'b']
dates = [dt.datetime(2012, 5, 2),dt.datetime(2012, 4, 2),dt.datetime(2012, 6, 2),
dt.datetime(2012, 7, 30),dt.datetime(2012, 4, 1),dt.datetime(2012, 5, 9)]
df =pd.DataFrame(np.column_stack((id,x,dates)), columns = ['id', 'x', 'dates'])
我希望能够设置回头周期(即70天),并为数据集中的每一行计算该id的任何先前事件和期望回溯(不包括计算执行所在的行的x)的x的累积和。
应该看起来像:
I'd like to be able to set a lookback period (i.e. 70 days) and calculate, for each row in the dataset, a cumulative sum of x for any preceding event for that id and within the desired lookback (excluding x for the row the calculation is being performed for). Should end up looking like:
id x dates want
0 a 0 2012-05-02 00:00:00 1
1 a 1 2012-04-02 00:00:00 0
2 b 2 2012-06-02 00:00:00 9
3 a 3 2012-07-30 00:00:00 0
4 b 4 2012-04-01 00:00:00 0
5 b 5 2012-05-09 00:00:00 4
推荐答案
好了,一种方法如下:(1) c $ c> groupby / apply 与'id'作为分组变量。 (2)在适用范围内,将 resample
组添加到日常时间序列中。 (3)然后使用 rolling_sum
(并且移动,以便不包含当前行的'x'值)来计算70天回顾期的总和。 (4)将组退回到原始观察值:
Well, one approach is the following: (1) do a groupby/apply
with 'id' as grouping variable. (2) Within the apply, resample
the group to a daily time series. (3) Then just using rolling_sum
(and shift so you don't include the current rows 'x' value) to compute the sum of your 70 day lookback periods. (4) Reduce the group back to only the original observations:
In [12]: df = df.sort(['id','dates'])
In [13]: df
Out[13]:
id x dates
1 a 1 2012-04-02
0 a 0 2012-05-02
3 a 3 2012-07-30
4 b 4 2012-04-01
5 b 5 2012-05-09
2 b 2 2012-06-02
您将需要按排序的数据, [ 'ID', '日期']
。现在我们可以执行 groupby / apply
:
You are going to need your data sorted by ['id','dates']
. Now we can do the groupby/apply
:
In [15]: def past70(g):
g = g.set_index('dates').resample('D','last')
g['want'] = pd.rolling_sum(g['x'],70,0).shift(1)
return g[g.x.notnull()]
In [16]: df = df.groupby('id').apply(past70).drop('id',axis=1)
In [17]: df
Out[17]:
x want
id dates
a 2012-04-02 1 NaN
2012-05-02 0 1
2012-07-30 3 0
b 2012-04-01 4 NaN
2012-05-09 5 4
2012-06-02 2 9
如果您不想使用NaN,那么请执行以下操作:
If you don't want the NaNs then just do:
In [28]: df.fillna(0)
Out[28]:
x want
id dates
a 2012-04-02 1 0
2012-05-02 0 1
2012-07-30 3 0
b 2012-04-01 4 0
2012-05-09 5 4
2012-06-02 2 9
编辑:如果您想让回溯窗口成为参数如下所示:
If you want to make the lookback window a parameter do something like the following:
def past_window(g,win=70):
g = g.set_index('dates').resample('D','last')
g['want'] = pd.rolling_sum(g['x'],win,0).shift(1)
return g[g.x.notnull()]
df = df.groupby('id').apply(past_window,win=10)
print df.fillna(0)
这篇关于大 pandas 时间累计总和按组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!