pandas 矢量化函数cumsum与numpy [英] Pandas vectorised function cumsum versus numpy

查看:122
本文介绍了 pandas 矢量化函数cumsum与numpy的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在回答问题对熊猫的计算进行矢量化数据框,我注意到了一个有关性能的有趣问题.

While answering the question Vectorize calculation of a Pandas Dataframe, I noticed an interesting issue regarding performance.

我的印象是对诸如df.min()df.mean()df.cumsum()等的功能进行了矢量化处理.但是,我发现df.cumsum()numpy替代方案之间的性能差异很大.

I was under the impression that functions such as df.min(), df.mean(), df.cumsum(), etc, are vectorised. However, I am seeing a massive discrepancy in performance between df.cumsum() and a numpy alternative.

鉴于pandas在其基础架构中使用numpy阵列,我希望性能会更接近.我尝试调查源代码表示df.cumsum(),但发现它很难处理.有人可以解释为什么它这么慢吗?

Given pandas uses numpy arrays in its infrastructure, I expected performance to be closer. I tried investigating the source code for df.cumsum() but found it intractable. Can someone explain why it is so much slower?

从@HYRY的答案中可以看出,问题简化为以下两个命令为何在时间安排上存在如此巨大差异的问题:

Seen from the answer by @HYRY the issue reduces to the question of why the following two commands give such a huge discrepancy in timings:

import pandas as pd, numpy as np
df_a = pd.DataFrame(np.arange(1,1000*1000+1).reshape(1000,1000))

%timeit pd.DataFrame(np.nancumsum(df_a.values))    #  4.18 ms
%timeit df_a.cumsum()                              # 15.7  ms

(由于我的numpy v1.11没有nancumsum,因此由评论者之一进行了计时.)

(Timing performed by one of the commentors, since my numpy v1.11 does not have nancumsum.)

推荐答案

这里似乎有几件事毫无用处.

There seem to be a couple things worth nothing here.

首先,df_a.cumsum()默认为axis=0(Pandas没有在一次调用中对整个DataFrame求和的概念),而NumPy调用默认为axis=None.因此,通过在一个操作上指定一个轴并有效地展平另一个操作,就可以将苹果与橘子进行比较.

First, df_a.cumsum() defaults to axis=0 (Pandas has no concept of summing the whole DataFrame in one call), while the NumPy call defaults to axis=None. So by specifying an axis on one operation and effectively flattening the other, you're comparing apples to oranges.

也就是说,您可以比较三个呼叫:

That said, there are three calls that you could compare:

>>> np.cumsum(df_a, axis=0)
>>> df_a.cumsum()
>>> val.cumsum(axis=0)  # val = df_a.values

其中,在最后一次调用中,val是基础的NumPy数组,我们不计入在运行时获取.values属性的情况.

where, in the final call, val is the underlying NumPy array and we don't count getting the .values attribute in runtime.

因此,如果您使用的是IPython Shell,请使用

So, if you're working in IPython shell, give line profiling with %prun a try:

>>> %prun -q -T pdcumsum.txt df_a.cumsum()

>>> val = df_a.values
>>> %prun -q -T ndarraycumsum.txt val.cumsum(axis=0)

>>> %prun -q -T df_npcumsum.txt np.cumsum(df_a, axis=0)

-T将输出保存为文本,以便您可以查看所有三个相互匹配的内容.这就是你的最终结果:

-T saves the output to text so that you can view all three matched up with one another. Here's what you end up with:

  • df_a.cumsum(): 186 个函数调用,.022秒.其中0.013用于numpy.ndarray.cumsum(). (我的猜测是,如果没有NaN,则不需要nancumsum(),但是请不要在此引用我的名字).另一个块花在了复制数组上.
  • val.cumsum(axis=0):5个函数调用,0.020秒.不进行任何复制(尽管这不是就地操作).
  • np.cumsum(df_a, axis=0): 204 函数调用,0.026秒.可以说,将Pandas对象传递给顶级NumPy函数似乎最终会在Pandas对象上调用等效方法,该方法要处理大量开销,然后重新调用NumPy函数.
  • df_a.cumsum(): 186 function calls, .022 seconds. 0.013 of that is spent on numpy.ndarray.cumsum(). (My guess is that if there are no NaNs, then nancumsum() isn't needed, but please don't quote me on that). Another chunk is spent on copying the array.
  • val.cumsum(axis=0): 5 function calls, 0.020 seconds. No copy is made (although this isn't an inplace operation).
  • np.cumsum(df_a, axis=0): 204 function calls, 0.026 seconds. Suffice it to say that passing a Pandas object to a top-level NumPy function seems to eventually invoke the equivalent method on the Pandas object, which goes through a whole bunch of overhead and then re-calls the NumPy function.

现在,与%timeit不同,您在这里只打了1个电话,就像在%time中一样,因此我不会太依赖%prun的相对时序差异.也许比较内部函数调用是有用的.但是在这种情况下,当您为两者指定相同的轴时,即使Pandas发出的呼叫次数与NumPy的呼叫次数相比,计时差异实际上并没有那么大.换句话说,在这种情况下,所有三个通话的时间都由np.ndarray.cumsum()决定,而辅助的Pandas通话不会消耗太多时间.在其他情况下,辅助Pandas调用确实会消耗更多的运行时间,但这似乎不是其中之一.

Now, unlike %timeit, you're only making 1 call here, as you would in %time, so I wouldn't lean too heavily on the relative timing differences with %prun; perhaps comparing the internal function calls is what's useful. But in this case, when you specify the same axis for both, the timing differences aren't actually that drastic, even if the number of calls made by Pandas dwarfs that of NumPy. In other words, in this case the time of all three calls is dominated by np.ndarray.cumsum(), and the ancillary Pandas calls don't eat up much time. There are other instances where the ancillary Pandas calls do eat up a lot more runtime, but this doesn't seem to be one of them.

全景图-如确认韦斯·麦金尼(Wes McKinney)

Big picture--as acknowledged by Wes McKinney,

从索引编制到汇总统计信息,相当简单的操作可能要经过多层支架,然后才能达到最低的计算层.

Fairly simple operations, from indexing to summary statistics, may pass through multiple layers of scaffolding before hitting the lowest tier of computations.

您可以辩称,要在灵活性和增强功能之间进行权衡.

with the tradeoff being flexibility and increased functionality, you could argue.

最后一个细节:在NumPy中,您可以通过调用实例方法ndarray.cumsum() 避免一点开销而不是顶级函数np.cumsum(),因为后者仅以路由到前者.但是正如一位智者曾经说过的那样,过早的优化是万恶之源.

One last detail: within NumPy, you can avoid a tiny bit of overhead by calling the instance method ndarray.cumsum() rather than the top-level function np.cumsum(), because the latter just ends up routing to the former. But as a wise man once said, premature optimization is the root of all evil.

供参考:

>>> pd.__version__, np.__version__
('0.22.0', '1.14.0')

这篇关于 pandas 矢量化函数cumsum与numpy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆