为什么`vectorize`优于`frompyfunc`? [英] Why `vectorize` is outperformed by `frompyfunc`?
问题描述
Numpy提供 vectorize
和 frompyfunc
具有相似的功能.
Numpy offers vectorize
and frompyfunc
with similar functionalies.
SO-post ,vectorize
As pointed out in this SO-post, vectorize
wraps frompyfunc
and handles the type of the returned array correctly, while frompyfunc
returns an array of np.object
.
但是,对于所有大小,frompyfunc
始终优于vectorize
10-20%,这也无法用不同的返回类型来解释.
However, frompyfunc
outperforms vectorize
consistently by 10-20% for all sizes, which can also not be explained with different return types.
请考虑以下变体:
import numpy as np
def do_double(x):
return 2.0*x
vectorize = np.vectorize(do_double)
frompyfunc = np.frompyfunc(do_double, 1, 1)
def wrapped_frompyfunc(arr):
return frompyfunc(arr).astype(np.float64)
wrapped_frompyfunc
只是将frompyfunc
的结果转换为正确的类型-如我们所见,此操作的成本几乎可以忽略不计.
wrapped_frompyfunc
just converts the result of frompyfunc
to the right type - as we can see, the costs of this operation are almost neglegible.
它导致以下计时(蓝线是frompyfunc
):
It results in the following timings (blue line is frompyfunc
):
我希望vectorize
会有更多开销-但这仅在小尺寸情况下才能看到.另一方面,在wrapped_frompyfunc
中也可以将np.object
转换为np.float64
-仍然要快得多.
I would expect vectorize
to have more overhead - but this should be seen only for small sizes. On the other hand, converting np.object
to np.float64
is also done in wrapped_frompyfunc
- which is still much faster.
如何解释这种性能差异?
How this performance difference can be explained?
使用perfplot-package产生时序比较的代码(鉴于上述功能):
Code to produce timing-comparison using perfplot-package (given the functions above):
import numpy as np
import perfplot
perfplot.show(
setup=lambda n: np.linspace(0, 1, n),
n_range=[2**k for k in range(20,27)],
kernels=[
frompyfunc,
vectorize,
wrapped_frompyfunc,
],
labels=["frompyfunc", "vectorize", "wrapped_frompyfunc"],
logx=True,
logy=False,
xlabel='len(x)',
equality_check = None,
)
NB:对于较小的尺寸,vectorize
的开销要高得多,但这是可以预料的(毕竟它包装了frompyfunc
):
NB: For smaller sizes, the overhead of vectorize
is much higher, but that is to be expected (it wraps frompyfunc
after all):
推荐答案
根据@hpaulj的提示,我们可以分析vectorize
-函数:
Following the hints of @hpaulj we can profile the vectorize
-function:
arr=np.linspace(0,1,10**7)
%load_ext line_profiler
%lprun -f np.vectorize._vectorize_call \
-f np.vectorize._get_ufunc_and_otypes \
-f np.vectorize.__call__ \
vectorize(arr)
这表明100%的时间都花在_vectorize_call
中:
which shows that 100% of time is spent in _vectorize_call
:
Timer unit: 1e-06 s
Total time: 3.53012 s
File: python3.7/site-packages/numpy/lib/function_base.py
Function: __call__ at line 2063
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2063 def __call__(self, *args, **kwargs):
...
2091 1 3530112.0 3530112.0 100.0 return self._vectorize_call(func=func, args=vargs)
...
Total time: 3.38001 s
File: python3.7/site-packages/numpy/lib/function_base.py
Function: _vectorize_call at line 2154
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2154 def _vectorize_call(self, func, args):
...
2161 1 85.0 85.0 0.0 ufunc, otypes = self._get_ufunc_and_otypes(func=func, args=args)
2162
2163 # Convert args to object arrays first
2164 1 1.0 1.0 0.0 inputs = [array(a, copy=False, subok=True, dtype=object)
2165 1 117686.0 117686.0 3.5 for a in args]
2166
2167 1 3089595.0 3089595.0 91.4 outputs = ufunc(*inputs)
2168
2169 1 4.0 4.0 0.0 if ufunc.nout == 1:
2170 1 172631.0 172631.0 5.1 res = array(outputs, copy=False, subok=True, dtype=otypes[0])
2171 else:
2172 res = tuple([array(x, copy=False, subok=True, dtype=t)
2173 for x, t in zip(outputs, otypes)])
2174 1 1.0 1.0 0.0 return res
它显示了我在假设中遗漏的部分:将双数组完全在预处理步骤中转换为对象数组(对内存而言这不是很明智的事情).其他部分与wrapped_frompyfunc
类似:
It shows the part I have missed in my assumptions: the double-array is converted to object-array entirely in a preprocessing step (which is not a very wise thing to do memory-wise). Other parts are similar for wrapped_frompyfunc
:
Timer unit: 1e-06 s
Total time: 3.20055 s
File: <ipython-input-113-66680dac59af>
Function: wrapped_frompyfunc at line 16
Line # Hits Time Per Hit % Time Line Contents
==============================================================
16 def wrapped_frompyfunc(arr):
17 1 3014961.0 3014961.0 94.2 a = frompyfunc(arr)
18 1 185587.0 185587.0 5.8 b = a.astype(np.float64)
19 1 1.0 1.0 0.0 return b
当我们查看峰值内存消耗(例如通过/usr/bin/time python script.py
)时,我们会发现vectorized
版本的内存消耗是frompyfunc
的两倍,它使用了更复杂的策略:数组以大小为 NPY_BUFSIZE
(8192),因此内存中仅同时存在8192个python-float(24bytes + 8byte指针)(而不是数组中的元素数量,后者可能更高).从操作系统保留内存的成本以及更多的高速缓存未命中可能是导致运行时间增加的原因.
When we take a look at peak memory consumption (e.g. via /usr/bin/time python script.py
), we will see, that the vectorized
version has twice the memory consumption of frompyfunc
, which uses a more sophisticated strategy: The double-array is handled in blocks of size NPY_BUFSIZE
(which is 8192) and thus only 8192 python-floats (24bytes+8byte pointer) are present in memory at the same time (and not the number of elements in array, which might be much higher). The costs of reserving the memory from the OS + more cache misses is probably what leads to higher running times.
我的收获:
- 可能根本不需要将所有输入都转换为对象数组的预处理步骤,因为
frompyfunc
具有处理这些转换的更为复杂的方法.
当产生的 - 都不应该使用
vectorize
不能使用frompyfunc
.取而代之的是,要么用C编写它,要么使用numba/like.
ufunc
应该以实数"形式使用时,- the preprocessing step, which converts all inputs into object-arrays, might be not needed at all, because
frompyfunc
has an even more sophisticated way of handling those conversions. - neither
vectorize
nofrompyfunc
should be used, when the resultingufunc
should be used in "real code". Instead one should either write it in C or use numba/similar.
在对象数组上调用frompyfunc
所需的时间少于在双数组上:
Calling frompyfunc
on the object-array needs less time than on the double-array:
arr=np.linspace(0,1,10**7)
a = arr.astype(np.object)
%timeit frompyfunc(arr) # 1.08 s ± 65.8 ms
%timeit frompyfunc(a) # 876 ms ± 5.58 ms
但是,上面的line-profiler-timings在对象上使用ufunc
而不是使用double并没有显示任何优势:3.089595s和3014961.0s.我的怀疑是,这是由于在创建所有对象的情况下更多的缓存未命中,而L2缓存中只有8192个创建的对象(256Kb)很热.
However, the line-profiler-timings above have not shown any advantage for using ufunc
on objects rather than doubles: 3.089595s vs 3014961.0s. My suspision is that it is due to more cache misses in the case when all objects are created vs. only 8192 created objects (256Kb) are hot in L2 cache.
这篇关于为什么`vectorize`优于`frompyfunc`?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!