使用cumprod进行快速 pandas groupby计算 [英] quick pandas groupby calculations with cumprod

查看:350
本文介绍了使用cumprod进行快速 pandas groupby计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题链接到 pandas groupby的加速.这是关于加快繁琐的cumproduct计算. DataFrame是2D并具有由3个整数组成的多重索引.

This question is linked to Speedup of pandas groupby. It is about speeding up a groubby cumproduct calculation. The DataFrame is 2D and has a multi index consisting of 3 integers.

可在此处找到数据帧的HDF5文件: http://filebin.ca/2Csy0E2QuF2w/phi.h5

The HDF5 file for the dataframe can be found here: http://filebin.ca/2Csy0E2QuF2w/phi.h5

我正在执行的实际计算与此类似:

The actual calculation that I'm performing is similar to this:

   >>> phi = pd.read_hdf('phi.h5', 'phi')
   >>> %timeit phi.groupby(level='atomic_number').cumprod()
   100 loops, best of 3: 5.45 ms per loop

另一种可能的提速方法是,我使用相同的索引结构但使用不同的数字执行此计算约100次.我想知道它是否可以某种方式缓存索引.

The other speedup that might be possible is that I do this calculation about 100 times using the same index structure but with different numbers. I wonder if it can somehow cache the index.

任何帮助将不胜感激.

Any help will be appreciated.

推荐答案

Numba在这里似乎工作得很好.实际上,以下结果与numba函数比原始方法快大约4,000倍,比没有groupby 的普通cumprod快5倍,这些结果似乎太好了.希望这些都是正确的,让我知道是否有错误.

Numba appears to work pretty well here. In fact, these results seem almost too good to be true with the numba function below being about 4,000x faster than the original method and 5x faster than plain cumprod without a groupby. Hopefully these are correct, let me know if there is an error.

np.random.seed(1234)
df=pd.DataFrame({ 'x':np.repeat(range(200),4), 'y':np.random.randn(800) })
df = df.sort('x')
df['cp_groupby'] = df.groupby('x').cumprod()

from numba import jit

@jit
def group_cumprod(x,y):
    z = np.ones(len(x))
    for i in range(len(x)):
        if x[i] == x[i-1]:
            z[i] = y[i] * z[i-1]
        else:
            z[i] = y[i]
    return z

df['cp_numba'] = group_cumprod(df.x.values,df.y.values)

df['dif'] = df.cp_groupby - df.cp_numba

测试两种方法给出的答案是否相同:

Test that both ways give the same answer:

all(df.cp_groupby==df.cp_numba)
Out[1447]: True

时间:

%timeit df.groupby('x').cumprod()
10 loops, best of 3: 102 ms per loop

%timeit df['y'].cumprod()
10000 loops, best of 3: 133 µs per loop

%timeit group_cumprod(df.x.values,df.y.values)
10000 loops, best of 3: 24.4 µs per loop

这篇关于使用cumprod进行快速 pandas groupby计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆