如何在 pandas 中按对象分组应用滚动功能 [英] How to apply rolling functions in a group by object in pandas
问题描述
我很难解决数据帧或 groupby 中的回溯或翻转问题.
以下是我拥有的数据框的一个简单示例:
水果数量20140101 苹果 320140102 苹果 520140102橙色1020140104 香蕉 2个20140104 苹果 1020140104橙色420140105橙色620140105葡萄1颗…20141231 苹果 320141231 葡萄2
我需要计算每天前 3 天每个水果的数量"的平均值,并创建以下数据框:
水果average_in_last 3 days20140104 苹果 420140104橙色10...
以20140104为例,前3天分别是20140101、20140102和20140103(注意数据框中的日期不连续,20140103不存在),苹果的平均数量为(3+5)/2 =4,橙色是10/1=10,其余都是0.
示例数据框非常简单,但实际的数据框要复杂得多且更大.希望有人能对此有所了解,提前致谢!
假设我们一开始就有这样的数据框,
<预><代码>>>>df水果量2017-06-01 苹果 12017-06-03 苹果 162017-06-04 苹果 122017-06-05 苹果 82017-06-06 苹果 142017-06-08 苹果 12017-06-09 苹果 42017-06-02 橙色 132017-06-03 橙色 92017-06-04 橙色 92017-06-05 橙色22017-06-06 橙色 112017-06-07 橙色 62017-06-08 橙色32017-06-09 橙色 32017-06-10 橙色 132017-06-02 葡萄 142017-06-03 葡萄162017-06-07 葡萄42017-06-09 葡萄152017-06-10 葡萄5>>>日期 = [i.date() for i in pd.date_range('2017-06-01', '2017-06-10')]>>>temp = (df.groupby('fruit')['amount'].apply(lambda x: x.reindex(dates) # 填写每组缺失的日期).fillna(0) # 用 0 填充每个缺失的组.rolling(3).sum()) # 做一个滚动求和.reset_index().rename(columns={'amount': 'sum_of_3_days','level_1': 'date'})) # 将日期索引重命名为日期 col>>>温度头()水果日期量0 苹果 2017-06-01 NaN1 苹果 2017-06-02 NaN2 苹果 2017-06-03 17.03 苹果 2017-06-04 28.04 苹果 2017-06-05 36.0# 将日期索引转换为日期列>>>df = df.reset_index().rename(columns={'index': 'date'})>>>df.merge(temp, on=['fruit', 'date'])>>>df日期水果量 sum_of_3_days0 2017-06-01 苹果 1 NaN1 2017-06-03 苹果 16 17.02 2017-06-04 苹果 12 28.03 2017-06-05 苹果 8 36.04 2017-06-06 苹果 14 34.05 2017-06-08 苹果 1 15.06 2017-06-09 苹果 4 5.07 2017-06-02 橙色 13 NaN8 2017-06-03 橙色 9 22.09 2017-06-04 橙色 9 31.010 2017-06-05 橙色 2 20.011 2017-06-06 橙色 11 22.012 2017-06-07 橙色 6 19.013 2017-06-08 橙色 3 20.014 2017-06-09 橙色 3 12.015 2017-06-10 橙色 13 19.016 2017-06-02 葡萄 14 NaN17 2017-06-03 葡萄 16 30.018 2017-06-07 葡萄 4 4.019 2017-06-09 葡萄 15 19.020 2017-06-10 葡萄 5 20.0I'm having difficulty to solve a look-back or roll-over problem in dataframe or perhaps in groupby.
The following is a simple example of the dataframe I have:
fruit amount
20140101 apple 3
20140102 apple 5
20140102 orange 10
20140104 banana 2
20140104 apple 10
20140104 orange 4
20140105 orange 6
20140105 grape 1
…
20141231 apple 3
20141231 grape 2
I need to calculate the average value of 'amount' of each fruit in the previous 3 days for everyday, and create the following data frame:
fruit average_in_last 3 days
20140104 apple 4
20140104 orange 10
...
For example on 20140104, the previous 3 days are 20140101, 20140102 and 20140103 (note the date in the data frame is not continuous and 20140103 does not exist), the average amount of apple is (3+5)/2 = 4 and orange is 10/1=10, the rest is 0.
The sample data frame is very simple but the actual data frame is much more complicated and larger. Hope someone can shed some light on this, thank you in advance!
Assuming we have a data frame like that in the beginning,
>>> df
fruit amount
2017-06-01 apple 1
2017-06-03 apple 16
2017-06-04 apple 12
2017-06-05 apple 8
2017-06-06 apple 14
2017-06-08 apple 1
2017-06-09 apple 4
2017-06-02 orange 13
2017-06-03 orange 9
2017-06-04 orange 9
2017-06-05 orange 2
2017-06-06 orange 11
2017-06-07 orange 6
2017-06-08 orange 3
2017-06-09 orange 3
2017-06-10 orange 13
2017-06-02 grape 14
2017-06-03 grape 16
2017-06-07 grape 4
2017-06-09 grape 15
2017-06-10 grape 5
>>> dates = [i.date() for i in pd.date_range('2017-06-01', '2017-06-10')]
>>> temp = (df.groupby('fruit')['amount']
.apply(lambda x: x.reindex(dates) # fill in the missing dates for each group)
.fillna(0) # fill each missing group with 0
.rolling(3)
.sum()) # do a rolling sum
.reset_index()
.rename(columns={'amount': 'sum_of_3_days',
'level_1': 'date'})) # rename date index to date col
>>> temp.head()
fruit date amount
0 apple 2017-06-01 NaN
1 apple 2017-06-02 NaN
2 apple 2017-06-03 17.0
3 apple 2017-06-04 28.0
4 apple 2017-06-05 36.0
# converts the date index into date column
>>> df = df.reset_index().rename(columns={'index': 'date'})
>>> df.merge(temp, on=['fruit', 'date'])
>>> df
date fruit amount sum_of_3_days
0 2017-06-01 apple 1 NaN
1 2017-06-03 apple 16 17.0
2 2017-06-04 apple 12 28.0
3 2017-06-05 apple 8 36.0
4 2017-06-06 apple 14 34.0
5 2017-06-08 apple 1 15.0
6 2017-06-09 apple 4 5.0
7 2017-06-02 orange 13 NaN
8 2017-06-03 orange 9 22.0
9 2017-06-04 orange 9 31.0
10 2017-06-05 orange 2 20.0
11 2017-06-06 orange 11 22.0
12 2017-06-07 orange 6 19.0
13 2017-06-08 orange 3 20.0
14 2017-06-09 orange 3 12.0
15 2017-06-10 orange 13 19.0
16 2017-06-02 grape 14 NaN
17 2017-06-03 grape 16 30.0
18 2017-06-07 grape 4 4.0
19 2017-06-09 grape 15 19.0
20 2017-06-10 grape 5 20.0
这篇关于如何在 pandas 中按对象分组应用滚动功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!