如何在大 pandas 中按对象分组应用滚动功能 [英] How to apply rolling functions in a group by object in pandas

查看:141
本文介绍了如何在大 pandas 中按对象分组应用滚动功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很难解决数据框中或者groupby中的回溯或滚动问题。

以下是数据框的一个简单示例我有:

 水果金额
20140101苹果3
20140102苹果5
20140102橙色10
20140104香蕉2
20140104苹果10
20140104橙色4
20140105橙色6
20140105葡萄1
...
20141231苹果3
20141231葡萄2

我需要计算以前每个水果'数量'的平均值每天3天,并创建以下数据框:

 水果average_in_last 3天
20140104苹果4
20140104 orange 10
...

例如在20140104,前3天是20140101,20140102和20140103(注意数据帧中的日期不连续且20140103不存在),苹果的平均数量是(3 + 5)/ 2 = 4,橙色是10/1 = 10,其余是0。示例数据框非常简单,但实际的数据框要复杂得多,而且要大得多。希望有人能对此有所了解,请提前致谢! 解决方案

假设我们在开始时有一个像这样的数据框,

 >>> df 
水果金额
2017-06-01苹果1
2017-06-03苹果16
2017-06-04苹果12
2017-06-05 apple 8
2017-06-06苹果14
2017-06-08苹果1
2017-06-09苹果4
2017-06-02橙色13
2017 -06-03橙色9
2017-06-04橙色9
2017-06-05橙色2
2017-06-06橙色11
2017-06-07橙色6
2017-06-08橙色3
2017-06-09橙色3
2017-06-10橙色13
2017-06-02葡萄14
2017- 06-03葡萄16
2017-06-07葡萄4
2017-06-09葡萄15
2017-06-10葡萄5

>> >日期= [i.date()for i in pd.date_range('2017-06-01','2017-06-10')]

>>> temp =(df.groupby('fruit')['amount']
.apply(lambda x:x.reindex(dates)#填入每个组的缺失日期)
.fillna(0 )#为每个缺失的组填充0
.rolling(3)
.sum())#执行滚动和
.reset_index()
.rename(columns = {'金额':'sum_of_3_days',
'level_1':'date'}))#重命名日期索引至日期col


>>> temp.head()
水果日期金额
0苹果2017-06-01 NaN
1苹果2017-06-02 NaN
2苹果2017-06-03 17.0
3苹果2017-06-04 28.0
4苹果2017-06-05 36.0

#将日期索引转换为日期列
>>> df = df.reset_index()。rename(columns = {'index':'date'})
>>> df.merge(temp,on = ['fruit','date'])
>>> df
日期水果金额sum_of_3_days
2017-06-01 apple 1 NaN
1 2017-06-03 apple 16 17.0
2 2017-06-04 apple 12 28.0
3 2017-06-05 apple 8 36.0
4 2017-06-06 apple 14 34.0
5 2017-06-08 apple 1 15.0
2017-06-09 apple 4 5.0
7 2017-06-02 orange 13 NaN
8 2017-06-03 orange 9 22.0
9 2017-06-04 orange 9 31.0
10 2017-06-05 orange 2 20.0
11 2017-06-06橙色11 22.0
12 2017-06-07橙色6 19.0
13 2017-06-08橙色3 20.0
14 2017-06- 09橙色3 12.0
15 2017-06-10橙色13 19.0
16 2017-06-02葡萄14 NaN
17 20 17-06-03葡萄16 30.0
18 2017-06-07葡萄4 4.0
19 2017-06-09葡萄15 19.0
20 2017-06-10葡萄5 20.0


I'm having difficulty to solve a look-back or roll-over problem in dataframe or perhaps in groupby.

The following is a simple example of the dataframe I have:

              fruit    amount    
   20140101   apple     3
   20140102   apple     5
   20140102   orange    10
   20140104   banana    2
   20140104   apple     10
   20140104   orange    4
   20140105   orange    6
   20140105   grape     1
   …
   20141231   apple     3
   20141231   grape     2

I need to calculate the average value of 'amount' of each fruit in the previous 3 days for everyday, and create the following data frame:

              fruit     average_in_last 3 days
   20140104   apple      4
   20140104   orange     10
   ...

For example on 20140104, the previous 3 days are 20140101, 20140102 and 20140103 (note the date in the data frame is not continuous and 20140103 does not exist), the average amount of apple is (3+5)/2 = 4 and orange is 10/1=10, the rest is 0.

The sample data frame is very simple but the actual data frame is much more complicated and larger. Hope someone can shed some light on this, thank you in advance!

解决方案

Assuming we have a data frame like that in the beginning,

>>> df
             fruit  amount
2017-06-01   apple       1
2017-06-03   apple      16
2017-06-04   apple      12
2017-06-05   apple       8
2017-06-06   apple      14
2017-06-08   apple       1
2017-06-09   apple       4
2017-06-02  orange      13
2017-06-03  orange       9
2017-06-04  orange       9
2017-06-05  orange       2
2017-06-06  orange      11
2017-06-07  orange       6
2017-06-08  orange       3
2017-06-09  orange       3
2017-06-10  orange      13
2017-06-02   grape      14
2017-06-03   grape      16
2017-06-07   grape       4
2017-06-09   grape      15
2017-06-10   grape       5

>>> dates = [i.date() for i in pd.date_range('2017-06-01', '2017-06-10')]

>>> temp = (df.groupby('fruit')['amount']
    .apply(lambda x: x.reindex(dates)  # fill in the missing dates for each group)
                      .fillna(0)   # fill each missing group with 0
                      .rolling(3)
                      .sum()) # do a rolling sum
    .reset_index()
    .rename(columns={'amount': 'sum_of_3_days', 
                     'level_1': 'date'}))  # rename date index to date col


>>> temp.head()
   fruit        date  amount
0  apple  2017-06-01     NaN
1  apple  2017-06-02     NaN
2  apple  2017-06-03    17.0
3  apple  2017-06-04    28.0
4  apple  2017-06-05    36.0

# converts the date index into date column 
>>> df = df.reset_index().rename(columns={'index': 'date'})  
>>> df.merge(temp, on=['fruit', 'date'])
>>> df
          date   fruit  amount  sum_of_3_days
0   2017-06-01   apple       1                NaN
1   2017-06-03   apple      16               17.0
2   2017-06-04   apple      12               28.0
3   2017-06-05   apple       8               36.0
4   2017-06-06   apple      14               34.0
5   2017-06-08   apple       1               15.0
6   2017-06-09   apple       4                5.0
7   2017-06-02  orange      13                NaN
8   2017-06-03  orange       9               22.0
9   2017-06-04  orange       9               31.0
10  2017-06-05  orange       2               20.0
11  2017-06-06  orange      11               22.0
12  2017-06-07  orange       6               19.0
13  2017-06-08  orange       3               20.0
14  2017-06-09  orange       3               12.0
15  2017-06-10  orange      13               19.0
16  2017-06-02   grape      14                NaN
17  2017-06-03   grape      16               30.0
18  2017-06-07   grape       4                4.0
19  2017-06-09   grape      15               19.0
20  2017-06-10   grape       5               20.0

这篇关于如何在大 pandas 中按对象分组应用滚动功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆