加速pandas groupby中的滚动总和计算 [英] Speeding up rolling sum calculation in pandas groupby

查看:80
本文介绍了加速pandas groupby中的滚动总和计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想为大量组逐组计算滚动总和,但我无法以可接受的速度快速完成.

I want to compute rolling sums group-wise for a large number of groups and I'm having trouble doing it acceptably quickly.

Pandas 具有用于滚动和扩展计算的内置方法

Pandas has build-in methods for rolling and expanding calculations

这是一个例子:

import pandas as pd
import numpy as np
obs_per_g = 20
g = 10000
obs = g * obs_per_g
k = 20
df = pd.DataFrame(
    data=np.random.normal(size=obs * k).reshape(obs, k),
    index=pd.MultiIndex.from_product(iterables=[range(g), range(obs_per_g)]),
)

为了滚动和扩大我可以使用的金额

To get rolling and expanding sums I can use

df.groupby(level=0).expanding().sum()
df.groupby(level=0).rolling(window=5).sum()

但是对于非常多的组,这需要很长时间.为了扩大总和,使用 pandas 方法 cumsum 几乎快 60 倍(上例为 16 秒 vs 280 毫秒),并将数小时变为分钟.

But this takes a long time for a very large number of groups. For expanding sums, using instead the pandas method cumsum is almost 60 times quicker (16s vs 280ms for the above example) and turns hours into minutes.

df.groupby(level=0).cumsum()

pandas 中是否有快速实现滚动总和的方法,就像 cumsum 用于扩展总和一样?如果没有,我可以使用 numpy 来完成这个吗?

Is there a fast implementation of rolling sum in pandas, like cumsum is for expanding sums? If not, could I use numpy to accomplish this?

推荐答案

我对 .rolling() 也有同样的体验,它很好,但仅限于小数据集,或者如果您应用的函数是非标准,使用 sum() 我建议使用 cumsum() 并减去 cumsum().shift(5)

I had the same experience with .rolling() its nice, but only with small datasets or if the function you are applying is non standard, with sum() I would suggest using cumsum() and subtracting cumsum().shift(5)

df.groupby(level=0).cumsum() - df.groupby(level=0).cumsum().shift(5)

这篇关于加速pandas groupby中的滚动总和计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆