在NumPy数组上进行迭代时,为什么Cython比Numba慢得多? [英] Why is Cython so much slower than Numba when iterating over NumPy arrays?

查看:167
本文介绍了在NumPy数组上进行迭代时,为什么Cython比Numba慢得多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在NumPy数组上进行迭代时,Numba似乎比Cython快得多.
我可能会缺少哪些Cython优化?

When iterating over NumPy arrays, Numba seems dramatically faster than Cython.
What Cython optimizations am I possibly missing?

这是一个简单的例子:

import numpy as np

def f(arr):
  res=np.zeros(len(arr))

  for i in range(len(arr)):
     res[i]=(arr[i])**2

  return res

arr=np.random.rand(10000)
%timeit f(arr)

输出:每个循环4.81毫秒±72.2 µs(平均±标准偏差,共运行7次,每个循环100个循环)

out: 4.81 ms ± 72.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%load_ext cython
%%cython

import numpy as np
cimport numpy as np
cimport cython
from libc.math cimport pow

#@cython.boundscheck(False)
#@cython.wraparound(False)

cpdef f(double[:] arr):
   cdef np.ndarray[dtype=np.double_t, ndim=1] res
   res=np.zeros(len(arr),dtype=np.double)
   cdef double[:] res_view=res
   cdef int i

   for i in range(len(arr)):
      res_view[i]=pow(arr[i],2)

   return res

arr=np.random.rand(10000)
%timeit f(arr)

输出:每个循环445 µs±5.49 µs(平均±标准偏差,共运行7次,每个循环1000次)

Out:445 µs ± 5.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

import numpy as np
import numba as nb

@nb.jit(nb.float64[:](nb.float64[:]))
def   f(arr):
   res=np.zeros(len(arr))

   for i in range(len(arr)):
       res[i]=(arr[i])**2

   return res

arr=np.random.rand(10000)
%timeit f(arr)

输出:每循环9.59 µs±98.8 ns(平均±标准偏差,共运行7次,每个循环100000次)

Out:9.59 µs ± 98.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

在此示例中,Numba比Cython快50倍.
作为赛顿的初学者,我想我缺少一些东西.

In this example, Numba is almost 50 times faster than Cython.
Being a Cython beginner, I guess I am missing something.

当然,在这种简单的情况下,使用NumPy square向量化函数会更合适:

Of course in this simple case using the NumPy square vectorized function would have been far more suitable:

%timeit np.square(arr)

输出:每个循环5.75 µs±78.9 ns(平均值±标准偏差,共运行7次,每个循环100000次)

Out:5.75 µs ± 78.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

推荐答案

正如@Antonio指出的那样,使用pow进行简单的乘法不是很明智,并且会导致相当大的开销:

As @Antonio has pointed out, using pow for a simple multiplication is not very wise and leads to quite an overhead:

因此,通过arr[i]*arr[i]替换pow(arr[i], 2)可以大大提高速度:

Thus, replacing pow(arr[i], 2) through arr[i]*arr[i] leads to a pretty large speed-up:

cython-pow-version        356 µs
numba-version              11 µs
cython-mult-version        14 µs

剩余的差异可能是由于编译器和优化级别之间的差异(在我的案例中为llvm vs MSVC).您可能想使用clang来匹配numba的性能(例如,请参见此 SO-answer )

The remaining difference is probably due to difference between the compilers and levels of optimizations (llvm vs MSVC in my case). You might want to use clang to match numba performance (see for example this SO-answer)

为了使编译器更容易进行优化,应将输入声明为连续数组,即double[::1] arr(请参阅此问题为什么对矢量化很重要),请使用@cython.boundscheck(False)(使用选项-a可以看到黄色的少点),并添加编译器标志(即-O3-march=native或类似标记,具体取决于您编译器以启用矢量化功能,请注意默认情况下使用的构建标志,这些标志可能会抑制某些优化,例如 -fwrapv ) .最后,您可能需要用C编写Working-horse循环,使用标记/编译器的正确组合进行编译,然后使用Cython对其进行包装.

In order to make the optimization easier for the compiler, you should declare the input as continuous array, i.e. double[::1] arr (see this question why it is important for vectorization), use @cython.boundscheck(False) (use option -a to see that there is less yellow) and also add compiler flags (i.e. -O3, -march=native or similar depending on your compiler to enable the vectorization, watch out for build-flags used by default which can inhibit some optimization, for example -fwrapv). In the end you might want to write the working-horse-loop in C, compile with the right combination of flags/compiler and use Cython to wrap it.

顺便说一句,通过将函数的参数键入为nb.float64[:](nb.float64[:])会降低numba的性能-不再允许假定输入数组是连续的,从而排除了向量化的可能性.让numba检测类型(或将其定义为连续的,即nb.float64[::1](nb.float64[::1]),您将获得更好的性能:

By the way, by typing function's paramters as nb.float64[:](nb.float64[:]) you decrease the performance of numba - it is no longer allowed to assume that the input array is continuous, thus ruling the vectorization out. Let numba detect the types (or define it as continuous, i.e. nb.float64[::1](nb.float64[::1]), and you will get better performance:

@nb.jit(nopython=True)
def nb_vec_f(arr):
   res=np.zeros(len(arr))

   for i in range(len(arr)):
       res[i]=(arr[i])**2

   return res

可带来以下改进:

%timeit f(arr)  # numba version
# 11.4 µs ± 137 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit nb_vec_f(arr)
# 7.03 µs ± 48.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


正如@ max9111指出的那样,我们不必使用零初始化结果数组,但是可以使用np.empty(...)代替np.zeros(...)-此版本甚至胜过了numpy的np.square()


And as pointed out by @max9111, we don't have to initialize the resulting array with zeros, but can use np.empty(...) instead of np.zeros(...) - this version even beats the numpy's np.square()

我的机器上不同方法的性能如下:

The performances of different approaches on my machine are:

numba+vectorization+empty     3µs
np.square                     4µs
numba+vectorization           7µs
numba missed vectorization   11µs
cython+mult                  14µs
cython+pow                  356µs

这篇关于在NumPy数组上进行迭代时,为什么Cython比Numba慢得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆