改善大 pandas groupby的性能 [英] Improving the performance of pandas groupby

查看:131
本文介绍了改善大 pandas groupby的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用Python编写的机器学习应用程序,其中包括一个数据处理步骤.当我写它的时候,我最初是在Pandas DataFrames上进行数据处理的,但是当这导致糟糕的性能时,我最终使用vanilla Python重写了它,使用for循环而不是矢量化操作,使用列表和字典而不是DataFrames和Series.令我惊讶的是,使用香草Python编写的代码的性能最终比使用Pandas编写的代码的性能高 .

I have a machine learning application written in Python which includes a data processing step. When I wrote it, I initially did the data processing on Pandas DataFrames, but when this lead to abysmal performance I eventually rewrote it using vanilla Python, with for loops instead of vectorized operations and lists and dicts instead of DataFrames and Series. To my surprise, the performance of the code written in vanilla Python ended up being far higher than that of the code written using Pandas.

由于我的手工编码数据处理代码比原始的Pandas代码更大,更混乱,所以我还没有完全放弃使用Pandas,并且我目前正在尝试对Pandas代码进行优化,但未取得太大的成功.

As my handcoded data processing code is substantially bigger and messier than the original Pandas code, I haven't quite given up on using Pandas, and I'm currently trying to optimize the Pandas code without much success.

数据处理步骤的核心包括以下内容:由于数据包含数千个时间序列(每个个体"一个),因此我首先将行分为几组,然后进行相同的数据处理在每个组上:大量摘要,将不同的列合并为新列,等等.

The core of the data processing step consists of the following: I first divide the rows into several groups, as the data consists of several thousand time series (one for each "individual"), and I then do the same data processing on each group: a lot of summarization, combining different columns into new ones, etc.

我使用Jupyter Notebook的lprun分析了我的代码,大部分时间花在了以下以及其他类似的行上:

I profiled my code using Jupyter Notebook's lprun, and the bulk of the time is spent on the following and other similar lines:

grouped_data = data.groupby('pk')
data[[v + 'Diff' for v in val_cols]] = grouped_data[val_cols].transform(lambda x: x - x.shift(1)).fillna(0)
data[[v + 'Mean' for v in val_cols]] = grouped_data[val_cols].rolling(4).mean().shift(1).reset_index()[val_cols]
(...)

...矢量化和非矢量化处理的混合.我知道,非矢量化操作不会比我的手写for循环快,因为基本上这就是它们的内幕,但是它们怎么会这么慢呢?我们正在谈论的是我的手写代码和Pandas代码之间的性能下降10-20倍.

...a mix of vectorized and non-vectorized processing. I understand that the non-vectorized operations won't be faster than my handwritten for loops, since that's basically what they are under the hood, but how can they be so much slower? We're talking about a performance degradation of 10-20x between my handwritten code and the Pandas code.

我做错什么了吗?

推荐答案

不,我不认为您应该放弃熊猫.绝对有更好的方法来做您要尝试的事情.诀窍是尽可能避免以任何形式出现apply/transform.避免像瘟疫一样躲避它们.它们基本上是针对for循环实现的,因此您不妨直接使用以C速度运行并提供更好性能的python for循环.

No, I don't think you should give up on pandas. There's definitely better ways to do what you're trying to. The trick is to avoid apply/transform in any form as much as possible. Avoid them like the plague. They're basically implemented as for loops, so you might as well directly use python for loops which operate at C speed and give you better performance.

真正的速度增益是摆脱循环并使用熊猫函数隐式向量化其操作的地方.例如,正如我很快向您展示的那样,您的第一行代码可以大大简化.

The real speed gain is where you get rid of the loops and use pandas' functions that implicitly vectorise their operations. For example, your first line of code can be simplified greatly, as I show you soon.

在这篇文章中,我概述了设置过程,然后针对问题中的每一行进行了改进,并进行了时间和正确性的并行比较.

In this post I outline the setup process, and then, for each line in your question, offer an improvement, along with a side-by-side comparison of the timings and correctness.

data = {'pk' : np.random.choice(10, 1000)} 
data.update({'Val{}'.format(i) : np.random.randn(1000) for i in range(100)})

df = pd.DataFrame(data)

g = df.groupby('pk')
c = ['Val{}'.format(i) for i in range(100)]


transform + sub + shiftdiff

您的第一行代码可以替换为简单的diff语句:


transform + sub + shiftdiff

Your first line of code can be replaced with a simple diff statement:

v1 = df.groupby('pk')[c].diff().fillna(0)

完整性检查

v2 = df.groupby('pk')[c].transform(lambda x: x - x.shift(1)).fillna(0)

np.allclose(v1, v2)
True

性能

%timeit df.groupby('pk')[c].transform(lambda x: x - x.shift(1)).fillna(0)
10 loops, best of 3: 44.3 ms per loop

%timeit df.groupby('pk')[c].diff(-1).fillna(0)
100 loops, best of 3: 9.63 ms per loop


删除冗余索引操作

就您的第二行代码而言,我看不出有太多改进的余地,尽管如果您的groupby语句不考虑pk,您可以摆脱reset_index() + [val_cols]调用作为索引:


Removing redundant indexing operations

As far as your second line of code is concerned, I don't see too much room for improvement, although you can get rid of the reset_index() + [val_cols] call if your groupby statement is not considering pk as the index:

g = df.groupby('pk', as_index=False)

您的第二行代码然后减少为:

Your second line of code then reduces to:

v3 = g[c].rolling(4).mean().shift(1)

完整性检查

g2 = df.groupby('pk')
v4 = g2[c].rolling(4).mean().shift(1).reset_index()[c]

np.allclose(v3.fillna(0), v4.fillna(0))
True

性能

%timeit df.groupby('pk')[c].rolling(4).mean().shift(1).reset_index()[c]
10 loops, best of 3: 46.5 ms per loop

%timeit df.groupby('pk', as_index=False)[c].rolling(4).mean().shift(1)
10 loops, best of 3: 41.7 ms per loop

请注意,时间在不同的机器上会有所不同,因此请确保您对代码进行彻底的测试,以确保数据确实有所改善.

Note that timings vary on different machines, so make sure you test your code thoroughly to make sure there is indeed an improvement on your data.

虽然这次的差异不大,但是您可以体会到可以进行改进的事实,您可以体会到!这可能会对更大的数据产生更大的影响.

While the difference this time isn't as much, you can appreciate the fact that there are improvements that you can make! This could possibly make a much larger impact for larger data.

总而言之,大多数操作速度很慢,因为它们可以加快速度.关键是要摆脱任何不使用向量化的方法.

In conclusion, most operations are slow because they can be sped up. The key is to get rid of any approach that does not use vectorization.

为此,有时走出熊猫空间并进入numpy空间有时是有益的.在numpy数组上或使用numpy上进行操作往往比对等的熊猫快得多(例如,np.sum快于pd.DataFrame.sum,而np.where快于pd.DataFrame.where,依此类推).

To this end, it is sometimes beneficial to step out of pandas space and step into numpy space. Operations on numpy arrays or using numpy tend to be much faster than pandas equivalents (for example, np.sum is faster than pd.DataFrame.sum, and np.where is faster than pd.DataFrame.where, and so on).

有时,无法避免循环.在这种情况下,您可以创建一个基本的循环功能,然后可以使用numba或cython对向量进行矢量化处理.有关示例,请参见提高性能,直接从马的嘴巴.

Sometimes, loops cannot be avoided. In which case, you can create a basic looping function which you can then vectorise using numba or cython. Examples of that are here at Enhancing Performance, straight from the horses mouth.

在其他情况下,您的数据太大,无法合理地容纳到numpy数组中.在这种情况下,应该放弃并切换到 dask spark ,两者均提供高性能的分布式计算大数据处理框架.

In still other cases, your data is just too big to reasonably fit into numpy arrays. In this case, it would be time to give up and switch to dask or spark, both of which offer high performance distributed computational frameworks for working with big data.

这篇关于改善大 pandas groupby的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆