pandas GroupBy和组中先前行的累积平均值 [英] pandas GroupBy and cumulative mean of previous rows in group
问题描述
我有一个看起来像这样的数据框:
I have a dataframe which looks like this:
pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
'order_start': [1,2,3,1,2,3,1,2,3,1],
'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16]})
Out[40]:
category order_start time
0 1 1 1
1 1 2 4
2 1 3 3
3 2 1 6
4 2 2 8
5 2 3 17
6 3 1 14
7 3 2 12
8 3 3 13
9 4 1 16
我想创建一个新列,其中包含同一类别以前时间的平均值.如何创建它?
I would like to create a new column which contains the mean of the previous times of the same category. How can I create it ?
新列应如下所示:
pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
'order_start': [1,2,3,1,2,3,1,2,3,1],
'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16],
'mean': [np.nan, 1, 2.5, np.nan, 6, 7, np.nan, 14, 13, np.nan]})
Out[41]:
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0 = 1 / 1
2 1 3 3 2.5 = (4+1)/2
3 2 1 6 NaN
4 2 2 8 6.0 = 6 / 1
5 2 3 17 7.0 = (8+6) / 2
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
注意:如果是第一次,则平均值应为NaN.
Note: If it is the first time, the mean should be NaN.
正如cs95所说,我的问题与这个问题并不完全相同因为在这里,需要扩展.
as stated by cs95, my question was not really the same as this one since here, expanding is required.
推荐答案
创建一个包含同一类别以前时间平均值的新列"听起来像GroupBy.expanding
的好用例(和一个移位) :
"create a new column which contains the mean of the previous times of the same category" sounds like a good use case for GroupBy.expanding
(and a shift):
df['mean'] = (
df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
df
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0
2 1 3 3 2.5
3 2 1 6 NaN
4 2 2 8 6.0
5 2 3 17 7.0
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
另一种计算方法是不使用apply
(链接两个groupby
调用):
Another way to calculate this is without the apply
(chaining two groupby
calls):
df['mean'] = (
df.groupby('category')['time']
.shift()
.groupby(df['category'])
.expanding()
.mean()
.to_numpy()) # replace to_numpy() with `.values` for pd.__version__ < 0.24
df
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0
2 1 3 3 2.5
3 2 1 6 NaN
4 2 2 8 6.0
5 2 3 17 7.0
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
在性能方面,这实际上取决于小组的数量和规模.
In terms of performance, it really depends on the number and size of your groups.
这篇关于 pandas GroupBy和组中先前行的累积平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!