pandas 面具/哪里方法与NumPy np.where [英] Pandas mask / where methods versus NumPy np.where
问题描述
我经常使用熊猫 mask
和 where
方法进行更清晰的逻辑有条件地更新一系列值时.但是,对于相对于性能至关重要的代码,我注意到相对于 numpy.where
.
I often use Pandas mask
and where
methods for cleaner logic when updating values in a series conditionally. However, for relatively performance-critical code I notice a significant performance drop relative to numpy.where
.
虽然我很高兴在特定情况下接受此要求,但我很想知道:
While I'm happy to accept this for specific cases, I'm interested to know:
- 除了
inplace
/errors
/try-cast
参数之外,Pandasmask
/where
方法是否提供任何其他功能吗?我了解这三个参数,但很少使用它们.例如,我不知道level
参数指的是什么. - 是否有不平凡的反例,其中
mask
/where
的表现优于numpy.where
?如果存在这样的示例,则可能会影响我以后如何选择适当的方法.
- Do Pandas
mask
/where
methods offer any additional functionality, apart frominplace
/errors
/try-cast
parameters? I understand those 3 parameters but rarely use them. For example, I have no idea what thelevel
parameter refers to. - Is there any non-trivial counter-example where
mask
/where
outperformsnumpy.where
? If such an example exists, it could influence how I choose appropriate methods going forwards.
作为参考,以下是Pandas 0.19.2/Python 3.6.0的一些基准测试:
For reference, here's some benchmarking on Pandas 0.19.2 / Python 3.6.0:
np.random.seed(0)
n = 10000000
df = pd.DataFrame(np.random.random(n))
assert (df[0].mask(df[0] > 0.5, 1).values == np.where(df[0] > 0.5, 1, df[0])).all()
%timeit df[0].mask(df[0] > 0.5, 1) # 145 ms per loop
%timeit np.where(df[0] > 0.5, 1, df[0]) # 113 ms per loop
对于非标量值,性能似乎进一步 :
The performance appears to diverge further for non-scalar values:
%timeit df[0].mask(df[0] > 0.5, df[0]*2) # 338 ms per loop
%timeit np.where(df[0] > 0.5, df[0]*2, df[0]) # 153 ms per loop
推荐答案
我正在使用pandas 0.23.3和Python 3.6,因此仅在您的第二个示例中,我才能看到运行时间的真正差异.
I'm using pandas 0.23.3 and Python 3.6, so I can see a real difference in running time only for your second example.
但是让我们研究一下第二个示例的稍有不同的版本(因此我们将2*df[0]
排除在外).这是我们在计算机上的基准:
But let's investigate a slightly different version of your second example (so we get2*df[0]
out of the way). Here is our baseline on my machine:
twice = df[0]*2
mask = df[0] > 0.5
%timeit np.where(mask, twice, df[0])
# 61.4 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df[0].mask(mask, twice)
# 143 ms ± 5.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numpy的版本比熊猫快2.3倍.
Numpy's version is about 2.3 times faster than pandas.
因此,让我们对这两个函数进行概要分析以了解两者之间的区别-当人们对代码基础不太熟悉时,进行概要分析是一种了解全局的好方法:它比调试更快,并且比尝试找出错误更容易出错仅仅通过阅读代码就能知道发生了什么.
So let's profile both functions to see the difference - profiling is a good way to get the big picture when one isn't very familiar with the code basis: it is faster than debugging and less error-prone than trying to figure out what's going on just by reading the code.
我在Linux上,使用 perf
.对于numpy的版本,我们得到(有关列表,请参阅附录A):
I'm on Linux and use perf
. For the numpy's version we get (for the listing see appendix A):
>>> perf record python np_where.py
>>> perf report
Overhead Command Shared Object Symbol
68,50% python multiarray.cpython-36m-x86_64-linux-gnu.so [.] PyArray_Where
8,96% python [unknown] [k] 0xffffffff8140290c
1,57% python mtrand.cpython-36m-x86_64-linux-gnu.so [.] rk_random
如我们所见,大部分时间花费在PyArray_Where
中-约占69%.未知符号是一个内核函数(事实上clear_page
)-我在没有root特权的情况下运行,因此该符号无法解析.
As we can see, the lion's share of the time is spent in PyArray_Where
- about 69%. The unknown symbol is a kernel function (as matter of fact clear_page
) - I run without root privileges so the symbol is not resolved.
对于大熊猫,我们可以获得(代码请参见附录B):
And for pandas we get (see Appendix B for code):
>>> perf record python pd_mask.py
>>> perf report
Overhead Command Shared Object Symbol
37,12% python interpreter.cpython-36m-x86_64-linux-gnu.so [.] vm_engine_iter_task
23,36% python libc-2.23.so [.] __memmove_ssse3_back
19,78% python [unknown] [k] 0xffffffff8140290c
3,32% python umath.cpython-36m-x86_64-linux-gnu.so [.] DOUBLE_isnan
1,48% python umath.cpython-36m-x86_64-linux-gnu.so [.] BOOL_logical_not
另一种情况:
- pandas doesn't use
PyArray_Where
under the hood - the most prominent time-consumer isvm_engine_iter_task
, which is numexpr-functionality. - there is some heavy memory-copying going on -
__memmove_ssse3_back
uses about25
% of time! Probably some of the kernel's functions are also connected to memory-accesses.
实际上,pandas-0.19在引擎盖下使用了PyArray_Where
,对于较旧的版本,perf-report的报告如下所示:
Actually, pandas-0.19 used PyArray_Where
under the hood, for the older version the perf-report would look like:
Overhead Command Shared Object Symbol
32,42% python multiarray.so [.] PyArray_Where
30,25% python libc-2.23.so [.] __memmove_ssse3_back
21,31% python [kernel.kallsyms] [k] clear_page
1,72% python [kernel.kallsyms] [k] __schedule
因此,基本上,那时它会在后台使用np.where
+一些开销(所有数据复制,请参见__memmove_ssse3_back
).
So basically it would use np.where
under the hood + some overhead (all above data-copying, see __memmove_ssse3_back
) back then.
在熊猫的0.19版中,我看不到大熊猫会比numpy更快的情况-它只是增加了numpy功能的开销.熊猫的0.23.3版本是一个完全不同的故事-这里使用了numexpr-module,在某些情况下,熊猫的版本(至少略有)更快是很有可能的.
I see no scenario where pandas could become faster than numpy in pandas' version 0.19 - it just adds overhead to numpy's functionality. Pandas' version 0.23.3 is an entirely different story - here numexpr-module is used, it is very possible that there are scenarios for which pandas' version is (at least slightly) faster.
我不确定是否真的需要/必须进行这种内存复制-也许甚至可以称它为性能错误,但我只是不知道可以肯定什么.
I'm not sure this memory-copying is really called for/necessary - maybe one even could call it performance-bug, but I just don't know enough to be certain.
我们可以通过剥离一些间接信息(传递np.array
而不是pd.Series
)来帮助熊猫不要复制.例如:
We could help pandas not to copy, by peeling away some indirections (passing np.array
instead of pd.Series
). For example:
%timeit df[0].mask(mask.values > 0.5, twice.values)
# 75.7 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
现在,熊猫只慢25%.性能说明:
Now, pandas is only 25% slower. The perf says:
Overhead Command Shared Object Symbol
50,81% python interpreter.cpython-36m-x86_64-linux-gnu.so [.] vm_engine_iter_task
14,12% python [unknown] [k] 0xffffffff8140290c
9,93% python libc-2.23.so [.] __memmove_ssse3_back
4,61% python umath.cpython-36m-x86_64-linux-gnu.so [.] DOUBLE_isnan
2,01% python umath.cpython-36m-x86_64-linux-gnu.so [.] BOOL_logical_not
数据复制要少得多,但比numpy的版本要多,后者主要是造成开销的原因.
Much less data-copying, but still more than in the numpy's version which is mostly responsible for the overhead.
我的主要收获
-
pandas的潜力至少比numpy稍快(因为可能更快).但是,熊猫对数据复制的处理有些不透明,因此很难预测何时(由于不必要的数据复制)会掩盖这种潜力.
pandas has the potential to be at least slightly faster than numpy (because it is possible to be faster). However, pandas' somewhat opaque handling of data-copying makes it hard to predict when this potential is overshadowed by (unnecessary) data copying.
当where
/mask
的性能成为瓶颈时,我将使用numba/cython来提高性能-请参阅下文,我比较幼稚的尝试使用numba和cython.
when the performance of where
/mask
is the bottleneck, I would use numba/cython to improve the performance - see my rather naive tries to use numba and cython further below.
想法是要
np.where(df[0] > 0.5, df[0]*2, df[0])
版本,并消除了创建临时文件的需要-即df[0]*2
.
version and to eliminate the need to create a temporary - i.e, df[0]*2
.
如@ max9111所建议,使用numba:
As proposed by @max9111, using numba:
import numba as nb
@nb.njit
def nb_where(df):
n = len(df)
output = np.empty(n, dtype=np.float64)
for i in range(n):
if df[i]>0.5:
output[i] = 2.0*df[i]
else:
output[i] = df[i]
return output
assert(np.where(df[0] > 0.5, twice, df[0])==nb_where(df[0].values)).all()
%timeit np.where(df[0] > 0.5, df[0]*2, df[0])
# 85.1 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit nb_where(df[0].values)
# 17.4 ms ± 673 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
大约比numpy的版本快5倍!
Which is about factor 5 faster than the numpy's version!
这是我在Cython的帮助下提高性能的尝试,远未成功:
And here is my by far less successful try to improve the performance with help of Cython:
%%cython -a
cimport numpy as np
import numpy as np
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def cy_where(double[::1] df):
cdef int i
cdef int n = len(df)
cdef np.ndarray[np.float64_t] output = np.empty(n, dtype=np.float64)
for i in range(n):
if df[i]>0.5:
output[i] = 2.0*df[i]
else:
output[i] = df[i]
return output
assert (df[0].mask(df[0] > 0.5, 2*df[0]).values == cy_where(df[0].values)).all()
%timeit cy_where(df[0].values)
# 66.7± 753 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
使速度提高25%.不确定,为什么cython比numba慢得多.
gives 25% speed-up. Not sure, why cython is so much slower than numba though.
列表:
A: np_where.py:
A: np_where.py:
import pandas as pd
import numpy as np
np.random.seed(0)
n = 10000000
df = pd.DataFrame(np.random.random(n))
twice = df[0]*2
for _ in range(50):
np.where(df[0] > 0.5, twice, df[0])
B: pd_mask.py:
B: pd_mask.py:
import pandas as pd
import numpy as np
np.random.seed(0)
n = 10000000
df = pd.DataFrame(np.random.random(n))
twice = df[0]*2
mask = df[0] > 0.5
for _ in range(50):
df[0].mask(mask, twice)
这篇关于 pandas 面具/哪里方法与NumPy np.where的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!