如何在大 pandas 中按对象分组应用滚动功能 [英] How to apply rolling functions in a group by object in pandas
问题描述
我很难解决数据框中或者groupby中的回溯或滚动问题。
以下是数据框的一个简单示例我有:
水果金额
20140101苹果3
20140102苹果5
20140102橙色10
20140104香蕉2
20140104苹果10
20140104橙色4
20140105橙色6
20140105葡萄1
...
20141231苹果3
20141231葡萄2
我需要计算以前每个水果'数量'的平均值每天3天,并创建以下数据框:
水果average_in_last 3天
20140104苹果4
20140104 orange 10
...
例如在20140104,前3天是20140101,20140102和20140103(注意数据帧中的日期不连续且20140103不存在),苹果的平均数量是(3 + 5)/ 2 = 4,橙色是10/1 = 10,其余是0。示例数据框非常简单,但实际的数据框要复杂得多,而且要大得多。希望有人能对此有所了解,请提前致谢! 解决方案
假设我们在开始时有一个像这样的数据框,
>>> df
水果金额
2017-06-01苹果1
2017-06-03苹果16
2017-06-04苹果12
2017-06-05 apple 8
2017-06-06苹果14
2017-06-08苹果1
2017-06-09苹果4
2017-06-02橙色13
2017 -06-03橙色9
2017-06-04橙色9
2017-06-05橙色2
2017-06-06橙色11
2017-06-07橙色6
2017-06-08橙色3
2017-06-09橙色3
2017-06-10橙色13
2017-06-02葡萄14
2017- 06-03葡萄16
2017-06-07葡萄4
2017-06-09葡萄15
2017-06-10葡萄5
>> >日期= [i.date()for i in pd.date_range('2017-06-01','2017-06-10')]
>>> temp =(df.groupby('fruit')['amount']
.apply(lambda x:x.reindex(dates)#填入每个组的缺失日期)
.fillna(0 )#为每个缺失的组填充0
.rolling(3)
.sum())#执行滚动和
.reset_index()
.rename(columns = {'金额':'sum_of_3_days',
'level_1':'date'}))#重命名日期索引至日期col
>>> temp.head()
水果日期金额
0苹果2017-06-01 NaN
1苹果2017-06-02 NaN
2苹果2017-06-03 17.0
3苹果2017-06-04 28.0
4苹果2017-06-05 36.0
#将日期索引转换为日期列
>>> df = df.reset_index()。rename(columns = {'index':'date'})
>>> df.merge(temp,on = ['fruit','date'])
>>> df
日期水果金额sum_of_3_days
2017-06-01 apple 1 NaN
1 2017-06-03 apple 16 17.0
2 2017-06-04 apple 12 28.0
3 2017-06-05 apple 8 36.0
4 2017-06-06 apple 14 34.0
5 2017-06-08 apple 1 15.0
2017-06-09 apple 4 5.0
7 2017-06-02 orange 13 NaN
8 2017-06-03 orange 9 22.0
9 2017-06-04 orange 9 31.0
10 2017-06-05 orange 2 20.0
11 2017-06-06橙色11 22.0
12 2017-06-07橙色6 19.0
13 2017-06-08橙色3 20.0
14 2017-06- 09橙色3 12.0
15 2017-06-10橙色13 19.0
16 2017-06-02葡萄14 NaN
17 20 17-06-03葡萄16 30.0
18 2017-06-07葡萄4 4.0
19 2017-06-09葡萄15 19.0
20 2017-06-10葡萄5 20.0
I'm having difficulty to solve a look-back or roll-over problem in dataframe or perhaps in groupby.
The following is a simple example of the dataframe I have:
fruit amount
20140101 apple 3
20140102 apple 5
20140102 orange 10
20140104 banana 2
20140104 apple 10
20140104 orange 4
20140105 orange 6
20140105 grape 1
…
20141231 apple 3
20141231 grape 2
I need to calculate the average value of 'amount' of each fruit in the previous 3 days for everyday, and create the following data frame:
fruit average_in_last 3 days
20140104 apple 4
20140104 orange 10
...
For example on 20140104, the previous 3 days are 20140101, 20140102 and 20140103 (note the date in the data frame is not continuous and 20140103 does not exist), the average amount of apple is (3+5)/2 = 4 and orange is 10/1=10, the rest is 0.
The sample data frame is very simple but the actual data frame is much more complicated and larger. Hope someone can shed some light on this, thank you in advance!
Assuming we have a data frame like that in the beginning,
>>> df
fruit amount
2017-06-01 apple 1
2017-06-03 apple 16
2017-06-04 apple 12
2017-06-05 apple 8
2017-06-06 apple 14
2017-06-08 apple 1
2017-06-09 apple 4
2017-06-02 orange 13
2017-06-03 orange 9
2017-06-04 orange 9
2017-06-05 orange 2
2017-06-06 orange 11
2017-06-07 orange 6
2017-06-08 orange 3
2017-06-09 orange 3
2017-06-10 orange 13
2017-06-02 grape 14
2017-06-03 grape 16
2017-06-07 grape 4
2017-06-09 grape 15
2017-06-10 grape 5
>>> dates = [i.date() for i in pd.date_range('2017-06-01', '2017-06-10')]
>>> temp = (df.groupby('fruit')['amount']
.apply(lambda x: x.reindex(dates) # fill in the missing dates for each group)
.fillna(0) # fill each missing group with 0
.rolling(3)
.sum()) # do a rolling sum
.reset_index()
.rename(columns={'amount': 'sum_of_3_days',
'level_1': 'date'})) # rename date index to date col
>>> temp.head()
fruit date amount
0 apple 2017-06-01 NaN
1 apple 2017-06-02 NaN
2 apple 2017-06-03 17.0
3 apple 2017-06-04 28.0
4 apple 2017-06-05 36.0
# converts the date index into date column
>>> df = df.reset_index().rename(columns={'index': 'date'})
>>> df.merge(temp, on=['fruit', 'date'])
>>> df
date fruit amount sum_of_3_days
0 2017-06-01 apple 1 NaN
1 2017-06-03 apple 16 17.0
2 2017-06-04 apple 12 28.0
3 2017-06-05 apple 8 36.0
4 2017-06-06 apple 14 34.0
5 2017-06-08 apple 1 15.0
6 2017-06-09 apple 4 5.0
7 2017-06-02 orange 13 NaN
8 2017-06-03 orange 9 22.0
9 2017-06-04 orange 9 31.0
10 2017-06-05 orange 2 20.0
11 2017-06-06 orange 11 22.0
12 2017-06-07 orange 6 19.0
13 2017-06-08 orange 3 20.0
14 2017-06-09 orange 3 12.0
15 2017-06-10 orange 13 19.0
16 2017-06-02 grape 14 NaN
17 2017-06-03 grape 16 30.0
18 2017-06-07 grape 4 4.0
19 2017-06-09 grape 15 19.0
20 2017-06-10 grape 5 20.0
这篇关于如何在大 pandas 中按对象分组应用滚动功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!