pandas GroupBy和组中先前行的累积平均值 [英] pandas GroupBy and cumulative mean of previous rows in group

查看:92
本文介绍了 pandas GroupBy和组中先前行的累积平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来像这样的数据框:

I have a dataframe which looks like this:

pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
              'order_start': [1,2,3,1,2,3,1,2,3,1],
              'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16]})
Out[40]: 
   category  order_start  time
0         1            1     1
1         1            2     4
2         1            3     3
3         2            1     6
4         2            2     8
5         2            3    17
6         3            1    14
7         3            2    12
8         3            3    13
9         4            1    16

我想创建一个新列,其中包含同一类别以前时间的平均值.如何创建它?

I would like to create a new column which contains the mean of the previous times of the same category. How can I create it ?

新列应如下所示:

pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
              'order_start': [1,2,3,1,2,3,1,2,3,1],
              'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16],
              'mean': [np.nan, 1, 2.5, np.nan, 6, 7, np.nan, 14, 13, np.nan]})
Out[41]: 
   category  order_start  time  mean
0         1            1     1   NaN
1         1            2     4   1.0    = 1 / 1
2         1            3     3   2.5    = (4+1)/2
3         2            1     6   NaN
4         2            2     8   6.0    = 6 / 1
5         2            3    17   7.0    = (8+6) / 2
6         3            1    14   NaN
7         3            2    12  14.0
8         3            3    13  13.0
9         4            1    16   NaN

注意:如果是第一次,则平均值应为NaN.

Note: If it is the first time, the mean should be NaN.

正如cs95所说,我的问题与这个问题并不完全相同因为在这里,需要扩展.

as stated by cs95, my question was not really the same as this one since here, expanding is required.

推荐答案

创建一个包含同一类别以前时间平均值的新列"听起来像GroupBy.expanding的好用例(和一个移位) :

"create a new column which contains the mean of the previous times of the same category" sounds like a good use case for GroupBy.expanding (and a shift):

df['mean'] = (
    df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
df
   category  order_start  time  mean
0         1            1     1   NaN
1         1            2     4   1.0
2         1            3     3   2.5
3         2            1     6   NaN
4         2            2     8   6.0
5         2            3    17   7.0
6         3            1    14   NaN
7         3            2    12  14.0
8         3            3    13  13.0
9         4            1    16   NaN


另一种计算方法是不使用apply(链接两个groupby调用):


Another way to calculate this is without the apply (chaining two groupby calls):

df['mean'] = (
    df.groupby('category')['time']
      .shift()
      .groupby(df['category'])
      .expanding()
      .mean()
      .to_numpy())  # replace to_numpy() with `.values` for pd.__version__ < 0.24
df
   category  order_start  time  mean
0         1            1     1   NaN
1         1            2     4   1.0
2         1            3     3   2.5
3         2            1     6   NaN
4         2            2     8   6.0
5         2            3    17   7.0
6         3            1    14   NaN
7         3            2    12  14.0
8         3            3    13  13.0
9         4            1    16   NaN

在性能方面,这实际上取决于小组的数量和规模.

In terms of performance, it really depends on the number and size of your groups.

这篇关于 pandas GroupBy和组中先前行的累积平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆