Cython:了解html注释文件要说什么? [英] Cython: understanding what the html annotation file has to say?

查看:62
本文介绍了Cython:了解html注释文件要说什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

编译以下Cython代码后,我得到了如下所示的html文件:

After compiling the following Cython code, I get the html file that looks like this:

import numpy as np
cimport numpy as np

cpdef my_function(np.ndarray[np.double_t, ndim = 1] array_a,
                  np.ndarray[np.double_t, ndim = 1] array_b,
                  int n_rows,
                  int n_columns):

    array_a[0:-1:n_columns] = 0
    array_a[n_columns - 1:n_rows * n_columns:n_columns] = 0
    array_a[0:n_columns] = 0
    array_a[n_columns* (n_rows - 1):n_rows * n_columns] = 0
    array_b[array_a == 3] = 0

    return array_a, array_b

我的问题是,为什么我的函数的那些操作仍然是黄色的?这是否意味着代码仍然不如使用Cython的速度快?

My question is that why those operations of my function are still yellow? Does this mean that the code is still not as fast as it could be using Cython?

推荐答案

如您所知,黄线表示发生了与python的某些交互,即使用了python功能而不是原始的c功能,您可以查看产生的代码,看看会发生什么,以及是否可以/应该解决/避免.

As you already know, yellow lines mean that some interactions with python happen, i.e. python functionality and not raw c-functionality is used, and you can look into the produced code to see, what happens and if it can/should be fixed/avoided.

并非每次与python的交互都意味着(可衡量的)速度下降.

Not every interaction with python means a (measurable) slowdown.

让我们看一下这个简化的功能:

Let's take a look at this simplified function:

%%cython
cimport numpy as np
def use_slices(np.ndarray[np.double_t] a):
    a[0:len(a)]=0.0

当我们查看生成的代码时,我们看到了(我只保留了重要部分):

When we look into the produced code we see (I kept only the important parts):

  __pyx_t_1 = PyObject_Length(((PyObject *)__pyx_v_a)); 
  __pyx_t_2 = PyInt_FromSsize_t(__pyx_t_1); 
  __pyx_t_3 = PySlice_New(__pyx_int_0, __pyx_t_2, Py_None); 
  PyObject_SetItem(((PyObject *)__pyx_v_a)

因此,基本上,我们得到了一个新的切片(它是一个numpy数组),然后使用numpy的功能(PyObject_SetItem)将所有元素设置为0.0,这是引擎盖下的C代码.

So basically we get a new slice (which is a numpy-array) and then use numpy's functionality (PyObject_SetItem) to set all elements to 0.0, which is C-code under the hood.

让我们看一下手写的for循环的版本:

Let's take a look at version with hand-written for loop:

cimport numpy as np
def use_for(np.ndarray[np.double_t] a):
    cdef int i
    for i in range(len(a)):
        a[i]=0.0

它仍然使用PyObject_Length(由于length)和边界检查,但是否则它是C代码.当我们比较时间时:

It still uses PyObject_Length (because of length) and bound-checking, but otherwise it is C-code. When we compare times:

>>> import numpy as np
>>> a=np.ones((500,))
>>> %timeit use_slices(a)
100000 loops, best of 3: 1.85 µs per loop
>>> %timeit use_for(a)
1000000 loops, best of 3: 1.42 µs per loop

>>> b=np.ones((250000,))
>>> %timeit use_slices(b)
10000 loops, best of 3: 104 µs per loop
>>> %timeit use_for(b)
1000 loops, best of 3: 364 µs per loop

您可以看到为小尺寸创建切片的额外开销,但是for-version中的附加检查意味着从长远来看,它会有更多开销.

You can see the additional overhead of creating a slice for small sizes, but the additional checks in the for-version means it has more overhead in the long run.

让我们禁用这些检查:

%%cython
cimport cython
cimport numpy as np
@cython.boundscheck(False)
@cython.wraparound(False)
def use_for_no_checks(np.ndarray[np.double_t] a):
    cdef int i
    for i in range(len(a)):
        a[i]=0.0

在生成的html中,我们可以看到a[i]变得和它一样简单:

In the produced html we can see, that a[i] gets as simple as it gets:

 __pyx_t_3 = __pyx_v_i;
    *__Pyx_BufPtrStrided1d(__pyx_t_5numpy_double_t *, __pyx_pybuffernd_a.rcbuffer->pybuffer.buf, __pyx_t_3, __pyx_pybuffernd_a.diminfo[0].strides) = 0.0;
  }

__Pyx_BufPtrStrided1d(type, buf, i0, s0)(type)((char*)buf + i0 * s0)定义. 现在:

>>> %timeit use_for_no_checks(a)
1000000 loops, best of 3: 1.17 µs per loop
>>> %timeit use_for_no_checks(b)
1000 loops, best of 3: 246 µs per loop

我们可以通过在for循环中释放gil来进一步改善它:

We can improve it further by releasing gil in the for-loop:

%%cython
cimport cython
cimport numpy as np
@cython.boundscheck(False)
@cython.wraparound(False)
def use_for_no_checks_no_gil(np.ndarray[np.double_t] a):
    cdef int i
    cdef int n=len(a)
    with nogil:
      for i in range(n):
        a[i]=0.0

现在:

>>> %timeit use_for_no_checks_no_gil(a)
1000000 loops, best of 3: 1.07 µs per loop
>>> %timeit use_for_no_checks_no_gil(b)
10000 loops, best of 3: 166 µs per loop

因此它速度更快,但是对于较大的数组,您仍然无法击败numpy.

So it is somewhat faster, but still you cannot beat numpy for larger arrays.

我认为有两点可借鉴:

  1. Cython不会将切片转换为通过for循环进行访问,因此必须使用Python功能.
  2. 开销很小,但是它只调用numpy功能,大部分工作都是通过numpy代码完成的,而这无法通过Cython加快.


最后一次尝试使用memset函数:


One last try using memset function:

%%cython
from libc.string cimport memset
cimport numpy as np
def use_memset(np.ndarray[np.double_t] a):
    memset(&a[0], 0, len(a)*sizeof(np.double_t))

我们得到:

>>> %timeit use_memset(a)
1000000 loops, best of 3: 821 ns per loop
>>> %timeit use_memset(b)
10000 loops, best of 3: 102 µs per loop

对于大型数组,它也与numpy代码一样快.

It is also as fast as the numpy-code for large arrays.

如DavidW所建议,可以尝试使用内存视图:

As DavidW suggested, one could try to use memory-views:

%%cython
cimport numpy as np
def use_slices_memview(double[::1] a):
    a[0:len(a)]=0.0

导致较小数组的代码稍快一些,但是与较大数组(与numpy-slices相比)的代码类似:

leads to a slightly faster code for small arrays but similar fast code fr large arrays (compared to numpy-slices):

>>> %timeit use_slices_memview(a)
1000000 loops, best of 3: 1.52 µs per loop

>>> %timeit use_slices_memview(b)
10000 loops, best of 3: 105 µs per loop

这意味着内存视图切片的开销比numpy切片的开销小.这是产生的代码:

That means, that the memory-view slices have less overhead than the numpy-slices. Here is the produced code:

 __pyx_t_1 = __Pyx_MemoryView_Len(__pyx_v_a); 
  __pyx_t_2.data = __pyx_v_a.data;
  __pyx_t_2.memview = __pyx_v_a.memview;
  __PYX_INC_MEMVIEW(&__pyx_t_2, 0);
  __pyx_t_3 = -1;
  if (unlikely(__pyx_memoryview_slice_memviewslice(
    &__pyx_t_2,
    __pyx_v_a.shape[0], __pyx_v_a.strides[0], __pyx_v_a.suboffsets[0],
    0,
    0,
    &__pyx_t_3,
    0,
    __pyx_t_1,
    0,
    1,
    1,
    0,
    1) < 0))
{
    __PYX_ERR(0, 27, __pyx_L1_error)
}

{
      double __pyx_temp_scalar = 0.0;
      {
          Py_ssize_t __pyx_temp_extent = __pyx_t_2.shape[0];
          Py_ssize_t __pyx_temp_idx;
          double *__pyx_temp_pointer = (double *) __pyx_t_2.data;
          for (__pyx_temp_idx = 0; __pyx_temp_idx < __pyx_temp_extent; __pyx_temp_idx++) {
            *((double *) __pyx_temp_pointer) = __pyx_temp_scalar;
            __pyx_temp_pointer += 1;
          }
      }
  }
  __PYX_XDEC_MEMVIEW(&__pyx_t_2, 1);
  __pyx_t_2.memview = NULL;
  __pyx_t_2.data = NULL;

我认为最重要的部分:此代码不会创建其他临时对象-它会为切片重新使用现有的内存视图.

I think the most important part: this code doesn't create an additional temporary object - it reuses the existing memory view for the slice.

如果使用内存视图,我的编译器(至少对于我的机器而言)会生成稍快的代码.不确定是否值得调查.乍一看,每个迭代步骤的区别是:

My compiler produces (at least for my machine) a slightly faster code if memory views are used. Not sure whether it is worth an investigation. At the first sight the difference in every iteration step is:

# created code for memview-slices:
*((double *) __pyx_temp_pointer) = __pyx_temp_scalar;
 __pyx_temp_pointer += 1;

#created code for memview-for-loop:
 __pyx_v_i = __pyx_t_3;
 __pyx_t_4 = __pyx_v_i;
 *((double *) ( /* dim=0 */ ((char *) (((double *) data) + __pyx_t_4)) )) = 0.0;

我希望不同的编译器以不同的方式很好地处理此代码.但显然,第一个版本更容易优化.

I would expect different compilers handle this code differently well. But clearly, the first version is easier to get optimized.

正如Behzad Jamali指出的那样,double[:] adouble[::1] a之间存在差异.使用切片的第二个版本在我的计算机上快约20%.不同之处在于,在编译期间,对于double[::1]版本,内存访问将是连续的,并且可以用于优化.在带有double[:]的版本中,直到运行时我们才对步幅一无所知.

As Behzad Jamali pointed out, there is difference between double[:] a and double[::1] a. The second version using slices is about 20% faster on my machine. The difference is, that during the compile time it is know for the double[::1] version, that the memory-accesses will be consecutive and this can be used for optimization. In the version with double[:] we don't know anything about stride until the runtime.

这篇关于Cython:了解html注释文件要说什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆