从Cython代码生成SIMD指令 [英] Generating SIMD instructions from Cython code

查看:82
本文介绍了从Cython代码生成SIMD指令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要大致了解一下在高性能数字代码中使用Cython可以获得的性能。我感兴趣的一件事是找出优化的C编译器是否可以向量化Cython生成的代码。因此,我决定写一个下面的小示例:

I need to get an overview of the performance one can get from using Cython in high performance numerical code. One of the thing I am interested in is to find out if an optimizing C compiler can vectorize code generated by Cython. So I decided to write the following small example:

import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef int f(np.ndarray[int, ndim = 1] f):
    cdef int array_length =  f.shape[0]
    cdef int sum = 0
    cdef int k
    for k in range(array_length):
        sum += f[k]
    return sum

我知道有做这项工作的Numpy函数,但是我想要一个简单的代码,以了解Cython的功能。事实证明,生成的代码如下:

I know that there are Numpy functions that does the job, but I would like to have an easy code in order to understand what is possible with Cython. It turns out that the code generated with:

from distutils.core import setup
from Cython.Build import cythonize

setup(ext_modules = cythonize("sum.pyx"))

python setup.py build_ext --inplace

生成一个看起来像这样的C代码:

generates a C code which look likes this for the loop:

for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_1; __pyx_t_2 += 1) {
  __pyx_v_sum = __pyx_v_sum + (*(int *)((char *) 
    __pyx_pybuffernd_f.rcbuffer->pybuffer.buf +
    __pyx_t_2 * __pyx_pybuffernd_f.diminfo[0].strides)));
}

此代码的主要问题是编译器在编译时不知道 __ pyx_pybuffernd_f.diminfo [0] .strides 使得数组中的元素在内存中相互靠近。没有这些信息,编译器将无法有效地向量化。

The main problem with this code is that the compiler does not know at compile time that __pyx_pybuffernd_f.diminfo[0].strides is such that the elements of the array are close together in memory. Without that information, the compiler cannot vectorize efficiently.

是否可以通过Cython进行此类操作?

Is there a way to do such a thing from Cython?

推荐答案

您的代码中有两个问题(使用选项 -a 使其可见):

You have two problems in your code (use option -a to make it visible):


  1. numpy数组的索引不是高效

  2. 您忘记了中的 int cdef sum = 0

  1. The indexing of numpy array isn't efficient
  2. You have forgotten int in cdef sum=0

考虑到这一点,我们得到:

Taking this into account we get:

cpdef int f(np.ndarray[np.int_t] f):  ##HERE
    assert f.dtype == np.int
    cdef int array_length =  f.shape[0]
    cdef int sum = 0                  ##HERE
    cdef int k
    for k in range(array_length):
        sum += f[k]
    return sum

对于循环,以下代码:

int __pyx_t_5;
int __pyx_t_6;
Py_ssize_t __pyx_t_7;
....
__pyx_t_5 = __pyx_v_array_length;
for (__pyx_t_6 = 0; __pyx_t_6 < __pyx_t_5; __pyx_t_6+=1) {
   __pyx_v_k = __pyx_t_6;
   __pyx_t_7 = __pyx_v_k;
   __pyx_v_sum = (__pyx_v_sum + (*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_int_t *, __pyx_pybuffernd_f.rcbuffer->pybuffer.buf, __pyx_t_7, __pyx_pybuffernd_f.diminfo[0].strides)));

}

那还不错,但对于优化器而言,却不如人类编写的普通代码那么容易。正如您已经指出的, __ pyx_pybuffernd_f.diminfo [0] .strides 在编译时是未知的,这会阻止矢量化。

Which is not that bad, but not as easy for the optimizer as the normal code written by human. As you have already pointed out, __pyx_pybuffernd_f.diminfo[0].strides isn't known at compile time and this prevents vectorization.

但是,使用类型化的内存时,您会得到更好的结果视图,即:

cpdef int mf(int[::1] f):
    cdef int array_length =  len(f)
...

这会导致不透明的C -code-至少是我的编译器可以更好地优化:

which leads to a less opaque C-code - the one, at least my compiler, can better optimize:

 __pyx_t_2 = __pyx_v_array_length;
  for (__pyx_t_3 = 0; __pyx_t_3 < __pyx_t_2; __pyx_t_3+=1) {
    __pyx_v_k = __pyx_t_3;
    __pyx_t_4 = __pyx_v_k;
    __pyx_v_sum = (__pyx_v_sum + (*((int *) ( /* dim=0 */ ((char *) (((int *) __pyx_v_f.data) + __pyx_t_4)) ))));
  }

在这里最关键的一点是,我们要对赛顿说清楚,内存是连续的,即 int [:: 1] int [:] 相比numpy-arrays,必须考虑可能的 stride!= 1

The most crucial thing here, is that we make it clear to the cython, that the memory is continuous, i.e. int[::1] compared to int[:] as it is seen for numpy-arrays, for which a possible stride!=1 must be taken into account.

在这种情况下, cython生成的C代码在同一汇编程序中与我会写的代码。正如crisb所指出的,添加 -march = native 将导致向量化,但是在这种情况下,两个函数的汇编器将再次有所不同。

In this case, the cython-generated C-code results in the same assembler as the code I would have written. As crisb has pointed out, adding -march=native would lead to vectorization, but in this case the assembler of both functions would be slightly different again.

但是,以我的经验,编译器经常会遇到一些问题,无法优化由cython创建的循环,并且/或者容易遗漏细节,从而阻止生成真正好的C代码。因此,我处理工作循环的策略是用纯C语言编写它们,并使用cython来包装/访问它们-通常会更快一些,因为也可以使用专用的编译器标志来捕获此代码而不会影响整个Cython模块。

However, in my experience, compilers have quite often some problems to optimize loops created by cython and/or it is easier to miss a detail which prevents the generation of really good C-code. So my strategy for working-horse-loops is to write them in plain C and use cython for wrapping/accessing them - often it is somewhat faster, because one can also use dedicated compiler flags for this code-snipped without affecting the whole Cython-module.

这篇关于从Cython代码生成SIMD指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆