如何在 pandas 中按对象分组应用滚动功能 [英] How to apply rolling functions in a group by object in pandas

查看:22
本文介绍了如何在 pandas 中按对象分组应用滚动功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很难解决数据帧或 groupby 中的回溯或翻转问题.

以下是我拥有的数据框的一个简单示例:

 水果数量20140101 苹果 320140102 苹果 520140102橙色1020140104 香蕉 2个20140104 苹果 1020140104橙色420140105橙色620140105葡萄1颗…20141231 苹果 320141231 葡萄2

我需要计算每天前 3 天每个水果的数量"的平均值,并创建以下数据框:

 水果average_in_last 3 days20140104 苹果 420140104橙色10...

以20140104为例,前3天分别是20140101、20140102和20140103(注意数据框中的日期不连续,20140103不存在),苹果的平均数量为(3+5)/2 =4,橙色是10/1=10,其余都是0.

示例数据框非常简单,但实际的数据框要复杂得多且更大.希望有人能对此有所了解,提前致谢!

解决方案

假设我们一开始就有这样的数据框,

<预><代码>>>>df水果量2017-06-01 苹果 12017-06-03 苹果 162017-06-04 苹果 122017-06-05 苹果 82017-06-06 苹果 142017-06-08 苹果 12017-06-09 苹果 42017-06-02 橙色 132017-06-03 橙色 92017-06-04 橙色 92017-06-05 橙色22017-06-06 橙色 112017-06-07 橙色 62017-06-08 橙色32017-06-09 橙色 32017-06-10 橙色 132017-06-02 葡萄 142017-06-03 葡萄162017-06-07 葡萄42017-06-09 葡萄152017-06-10 葡萄5>>>日期 = [i.date() for i in pd.date_range('2017-06-01', '2017-06-10')]>>>temp = (df.groupby('fruit')['amount'].apply(lambda x: x.reindex(dates) # 填写每组缺失的日期).fillna(0) # 用 0 填充每个缺失的组.rolling(3).sum()) # 做一个滚动求和.reset_index().rename(columns={'amount': 'sum_of_3_days','level_1': 'date'})) # 将日期索引重命名为日期 col>>>温度头()水果日期量0 苹果 2017-06-01 NaN1 苹果 2017-06-02 NaN2 苹果 2017-06-03 17.03 苹果 2017-06-04 28.04 苹果 2017-06-05 36.0# 将日期索引转换为日期列>>>df = df.reset_index().rename(columns={'index': 'date'})>>>df.merge(temp, on=['fruit', 'date'])>>>df日期水果量 sum_of_3_days0 2017-06-01 苹果 1 NaN1 2017-06-03 苹果 16 17.02 2017-06-04 苹果 12 28.03 2017-06-05 苹果 8 36.04 2017-06-06 苹果 14 34.05 2017-06-08 苹果 1 15.06 2017-06-09 苹果 4 5.07 2017-06-02 橙色 13 NaN8 2017-06-03 橙色 9 22.09 2017-06-04 橙色 9 31.010 2017-06-05 橙色 2 20.011 2017-06-06 橙色 11 22.012 2017-06-07 橙色 6 19.013 2017-06-08 橙色 3 20.014 2017-06-09 橙色 3 12.015 2017-06-10 橙色 13 19.016 2017-06-02 葡萄 14 NaN17 2017-06-03 葡萄 16 30.018 2017-06-07 葡萄 4 4.019 2017-06-09 葡萄 15 19.020 2017-06-10 葡萄 5 20.0

I'm having difficulty to solve a look-back or roll-over problem in dataframe or perhaps in groupby.

The following is a simple example of the dataframe I have:

              fruit    amount    
   20140101   apple     3
   20140102   apple     5
   20140102   orange    10
   20140104   banana    2
   20140104   apple     10
   20140104   orange    4
   20140105   orange    6
   20140105   grape     1
   …
   20141231   apple     3
   20141231   grape     2

I need to calculate the average value of 'amount' of each fruit in the previous 3 days for everyday, and create the following data frame:

              fruit     average_in_last 3 days
   20140104   apple      4
   20140104   orange     10
   ...

For example on 20140104, the previous 3 days are 20140101, 20140102 and 20140103 (note the date in the data frame is not continuous and 20140103 does not exist), the average amount of apple is (3+5)/2 = 4 and orange is 10/1=10, the rest is 0.

The sample data frame is very simple but the actual data frame is much more complicated and larger. Hope someone can shed some light on this, thank you in advance!

解决方案

Assuming we have a data frame like that in the beginning,

>>> df
             fruit  amount
2017-06-01   apple       1
2017-06-03   apple      16
2017-06-04   apple      12
2017-06-05   apple       8
2017-06-06   apple      14
2017-06-08   apple       1
2017-06-09   apple       4
2017-06-02  orange      13
2017-06-03  orange       9
2017-06-04  orange       9
2017-06-05  orange       2
2017-06-06  orange      11
2017-06-07  orange       6
2017-06-08  orange       3
2017-06-09  orange       3
2017-06-10  orange      13
2017-06-02   grape      14
2017-06-03   grape      16
2017-06-07   grape       4
2017-06-09   grape      15
2017-06-10   grape       5

>>> dates = [i.date() for i in pd.date_range('2017-06-01', '2017-06-10')]

>>> temp = (df.groupby('fruit')['amount']
    .apply(lambda x: x.reindex(dates)  # fill in the missing dates for each group)
                      .fillna(0)   # fill each missing group with 0
                      .rolling(3)
                      .sum()) # do a rolling sum
    .reset_index()
    .rename(columns={'amount': 'sum_of_3_days', 
                     'level_1': 'date'}))  # rename date index to date col


>>> temp.head()
   fruit        date  amount
0  apple  2017-06-01     NaN
1  apple  2017-06-02     NaN
2  apple  2017-06-03    17.0
3  apple  2017-06-04    28.0
4  apple  2017-06-05    36.0

# converts the date index into date column 
>>> df = df.reset_index().rename(columns={'index': 'date'})  
>>> df.merge(temp, on=['fruit', 'date'])
>>> df
          date   fruit  amount  sum_of_3_days
0   2017-06-01   apple       1                NaN
1   2017-06-03   apple      16               17.0
2   2017-06-04   apple      12               28.0
3   2017-06-05   apple       8               36.0
4   2017-06-06   apple      14               34.0
5   2017-06-08   apple       1               15.0
6   2017-06-09   apple       4                5.0
7   2017-06-02  orange      13                NaN
8   2017-06-03  orange       9               22.0
9   2017-06-04  orange       9               31.0
10  2017-06-05  orange       2               20.0
11  2017-06-06  orange      11               22.0
12  2017-06-07  orange       6               19.0
13  2017-06-08  orange       3               20.0
14  2017-06-09  orange       3               12.0
15  2017-06-10  orange      13               19.0
16  2017-06-02   grape      14                NaN
17  2017-06-03   grape      16               30.0
18  2017-06-07   grape       4                4.0
19  2017-06-09   grape      15               19.0
20  2017-06-10   grape       5               20.0

这篇关于如何在 pandas 中按对象分组应用滚动功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆