加速pandas groupby中的滚动总和计算 [英] Speeding up rolling sum calculation in pandas groupby
问题描述
我想为大量组逐组计算滚动总和,但我无法以可接受的速度快速完成.
I want to compute rolling sums group-wise for a large number of groups and I'm having trouble doing it acceptably quickly.
Pandas 具有用于滚动和扩展计算的内置方法
Pandas has build-in methods for rolling and expanding calculations
这是一个例子:
import pandas as pd
import numpy as np
obs_per_g = 20
g = 10000
obs = g * obs_per_g
k = 20
df = pd.DataFrame(
data=np.random.normal(size=obs * k).reshape(obs, k),
index=pd.MultiIndex.from_product(iterables=[range(g), range(obs_per_g)]),
)
为了滚动和扩大我可以使用的金额
To get rolling and expanding sums I can use
df.groupby(level=0).expanding().sum()
df.groupby(level=0).rolling(window=5).sum()
但是对于非常多的组,这需要很长时间.为了扩大总和,使用 pandas 方法 cumsum 几乎快 60 倍(上例为 16 秒 vs 280 毫秒),并将数小时变为分钟.
But this takes a long time for a very large number of groups. For expanding sums, using instead the pandas method cumsum is almost 60 times quicker (16s vs 280ms for the above example) and turns hours into minutes.
df.groupby(level=0).cumsum()
pandas 中是否有快速实现滚动总和的方法,就像 cumsum 用于扩展总和一样?如果没有,我可以使用 numpy 来完成这个吗?
Is there a fast implementation of rolling sum in pandas, like cumsum is for expanding sums? If not, could I use numpy to accomplish this?
推荐答案
我对 .rolling()
也有同样的体验,它很好,但仅限于小数据集,或者如果您应用的函数是非标准,使用 sum()
我建议使用 cumsum()
并减去 cumsum().shift(5)
I had the same experience with .rolling()
its nice, but only with small datasets or if the function you are applying is non standard, with sum()
I would suggest using cumsum()
and subtracting cumsum().shift(5)
df.groupby(level=0).cumsum() - df.groupby(level=0).cumsum().shift(5)
这篇关于加速pandas groupby中的滚动总和计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!