为什么`vectorize`优于`frompyfunc`? [英] Why `vectorize` is outperformed by `frompyfunc`?

查看:77
本文介绍了为什么`vectorize`优于`frompyfunc`?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Numpy提供 vectorize frompyfunc 具有相似的功能.

Numpy offers vectorize and frompyfunc with similar functionalies.

SO-post vectorize

As pointed out in this SO-post, vectorize wraps frompyfunc and handles the type of the returned array correctly, while frompyfunc returns an array of np.object.

但是,对于所有大小,frompyfunc始终优于vectorize 10-20%,这也无法用不同的返回类型来解释.

However, frompyfunc outperforms vectorize consistently by 10-20% for all sizes, which can also not be explained with different return types.

请考虑以下变体:

import numpy as np

def do_double(x):
    return 2.0*x

vectorize = np.vectorize(do_double)

frompyfunc = np.frompyfunc(do_double, 1, 1)

def wrapped_frompyfunc(arr):
    return frompyfunc(arr).astype(np.float64)

wrapped_frompyfunc只是将frompyfunc的结果转换为正确的类型-如我们所见,此操作的成本几乎可以忽略不计.

wrapped_frompyfunc just converts the result of frompyfunc to the right type - as we can see, the costs of this operation are almost neglegible.

它导致以下计时(蓝线是frompyfunc):

It results in the following timings (blue line is frompyfunc):

我希望vectorize会有更多开销-但这仅在小尺寸情况下才能看到.另一方面,在wrapped_frompyfunc中也可以将np.object转换为np.float64-仍然要快得多.

I would expect vectorize to have more overhead - but this should be seen only for small sizes. On the other hand, converting np.object to np.float64 is also done in wrapped_frompyfunc - which is still much faster.

如何解释这种性能差异?

How this performance difference can be explained?

使用perfplot-package产生时序比较的代码(鉴于上述功能):

Code to produce timing-comparison using perfplot-package (given the functions above):

import numpy as np
import perfplot
perfplot.show(
    setup=lambda n: np.linspace(0, 1, n),
    n_range=[2**k for k in range(20,27)],
    kernels=[
        frompyfunc, 
        vectorize, 
        wrapped_frompyfunc,
        ],
    labels=["frompyfunc", "vectorize", "wrapped_frompyfunc"],
    logx=True,
    logy=False,
    xlabel='len(x)',
    equality_check = None,  
    )


NB:对于较小的尺寸,vectorize的开销要高得多,但这是可以预料的(毕竟它包装了frompyfunc):


NB: For smaller sizes, the overhead of vectorize is much higher, but that is to be expected (it wraps frompyfunc after all):

推荐答案

根据@hpaulj的提示,我们可以分析vectorize-函数:

Following the hints of @hpaulj we can profile the vectorize-function:

arr=np.linspace(0,1,10**7)
%load_ext line_profiler

%lprun -f np.vectorize._vectorize_call \
       -f np.vectorize._get_ufunc_and_otypes  \
       -f np.vectorize.__call__  \
       vectorize(arr)

这表明100%的时间都花在_vectorize_call中:

which shows that 100% of time is spent in _vectorize_call:

Timer unit: 1e-06 s

Total time: 3.53012 s
File: python3.7/site-packages/numpy/lib/function_base.py
Function: __call__ at line 2063

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  2063                                               def __call__(self, *args, **kwargs):
  ...                                         
  2091         1    3530112.0 3530112.0    100.0          return self._vectorize_call(func=func, args=vargs)

...

Total time: 3.38001 s
File: python3.7/site-packages/numpy/lib/function_base.py
Function: _vectorize_call at line 2154

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  2154                                               def _vectorize_call(self, func, args):
  ...
  2161         1         85.0     85.0      0.0              ufunc, otypes = self._get_ufunc_and_otypes(func=func, args=args)
  2162                                           
  2163                                                       # Convert args to object arrays first
  2164         1          1.0      1.0      0.0              inputs = [array(a, copy=False, subok=True, dtype=object)
  2165         1     117686.0 117686.0      3.5                        for a in args]
  2166                                           
  2167         1    3089595.0 3089595.0     91.4              outputs = ufunc(*inputs)
  2168                                           
  2169         1          4.0      4.0      0.0              if ufunc.nout == 1:
  2170         1     172631.0 172631.0      5.1                  res = array(outputs, copy=False, subok=True, dtype=otypes[0])
  2171                                                       else:
  2172                                                           res = tuple([array(x, copy=False, subok=True, dtype=t)
  2173                                                                        for x, t in zip(outputs, otypes)])
  2174         1          1.0      1.0      0.0          return res

它显示了我在假设中遗漏的部分:将双数组完全在预处理步骤中转换为对象数组(对内存而言这不是很明智的事情).其他部分与wrapped_frompyfunc类似:

It shows the part I have missed in my assumptions: the double-array is converted to object-array entirely in a preprocessing step (which is not a very wise thing to do memory-wise). Other parts are similar for wrapped_frompyfunc:

Timer unit: 1e-06 s

Total time: 3.20055 s
File: <ipython-input-113-66680dac59af>
Function: wrapped_frompyfunc at line 16

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    16                                           def wrapped_frompyfunc(arr):
    17         1    3014961.0 3014961.0     94.2      a = frompyfunc(arr)
    18         1     185587.0 185587.0      5.8      b = a.astype(np.float64)
    19         1          1.0      1.0      0.0      return b

当我们查看峰值内存消耗(例如通过/usr/bin/time python script.py)时,我们会发现vectorized版本的内存消耗是frompyfunc的两倍,它使用了更复杂的策略:数组以大小为 NPY_BUFSIZE (8192),因此内存中仅同时存在8192个python-float(24bytes + 8byte指针)(而不是数组中的元素数量,后者可能更高).从操作系统保留内存的成本以及更多的高速缓存未命中可能是导致运行时间增加的原因.

When we take a look at peak memory consumption (e.g. via /usr/bin/time python script.py), we will see, that the vectorized version has twice the memory consumption of frompyfunc, which uses a more sophisticated strategy: The double-array is handled in blocks of size NPY_BUFSIZE (which is 8192) and thus only 8192 python-floats (24bytes+8byte pointer) are present in memory at the same time (and not the number of elements in array, which might be much higher). The costs of reserving the memory from the OS + more cache misses is probably what leads to higher running times.

我的收获:

  • 可能根本不需要将所有输入都转换为对象数组的预处理步骤,因为frompyfunc具有处理这些转换的更为复杂的方法.
  • 当产生的ufunc应该以实数"形式使用时,
  • 都不应该使用vectorize不能使用frompyfunc.取而代之的是,要么用C编写它,要么使用numba/like.
  • the preprocessing step, which converts all inputs into object-arrays, might be not needed at all, because frompyfunc has an even more sophisticated way of handling those conversions.
  • neither vectorize no frompyfunc should be used, when the resulting ufunc should be used in "real code". Instead one should either write it in C or use numba/similar.

在对象数组上调用frompyfunc所需的时间少于在双数组上:

Calling frompyfunc on the object-array needs less time than on the double-array:

arr=np.linspace(0,1,10**7)
a = arr.astype(np.object)
%timeit frompyfunc(arr)  # 1.08 s ± 65.8 ms
%timeit frompyfunc(a)    # 876 ms ± 5.58 ms

但是,上面的line-profiler-timings在对象上使用ufunc而不是使用double并没有显示任何优势:3.089595s和3014961.0s.我的怀疑是,这是由于在创建所有对象的情况下更多的缓存未命中,而L2缓存中只有8192个创建的对象(256Kb)很热.

However, the line-profiler-timings above have not shown any advantage for using ufunc on objects rather than doubles: 3.089595s vs 3014961.0s. My suspision is that it is due to more cache misses in the case when all objects are created vs. only 8192 created objects (256Kb) are hot in L2 cache.

这篇关于为什么`vectorize`优于`frompyfunc`?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆