从Cython代码生成SIMD指令 [英] Generating SIMD instructions from Cython code
问题描述
我需要大致了解一下在高性能数字代码中使用Cython可以获得的性能。我感兴趣的一件事是找出优化的C编译器是否可以向量化Cython生成的代码。因此,我决定写一个下面的小示例:
I need to get an overview of the performance one can get from using Cython in high performance numerical code. One of the thing I am interested in is to find out if an optimizing C compiler can vectorize code generated by Cython. So I decided to write the following small example:
import numpy as np
cimport numpy as np
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef int f(np.ndarray[int, ndim = 1] f):
cdef int array_length = f.shape[0]
cdef int sum = 0
cdef int k
for k in range(array_length):
sum += f[k]
return sum
我知道有做这项工作的Numpy函数,但是我想要一个简单的代码,以了解Cython的功能。事实证明,生成的代码如下:
I know that there are Numpy functions that does the job, but I would like to have an easy code in order to understand what is possible with Cython. It turns out that the code generated with:
from distutils.core import setup
from Cython.Build import cythonize
setup(ext_modules = cythonize("sum.pyx"))
python setup.py build_ext --inplace
生成一个看起来像这样的C代码:
generates a C code which look likes this for the loop:
for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_1; __pyx_t_2 += 1) {
__pyx_v_sum = __pyx_v_sum + (*(int *)((char *)
__pyx_pybuffernd_f.rcbuffer->pybuffer.buf +
__pyx_t_2 * __pyx_pybuffernd_f.diminfo[0].strides)));
}
此代码的主要问题是编译器在编译时不知道 __ pyx_pybuffernd_f.diminfo [0] .strides
使得数组中的元素在内存中相互靠近。没有这些信息,编译器将无法有效地向量化。
The main problem with this code is that the compiler does not know at compile time that __pyx_pybuffernd_f.diminfo[0].strides
is such that the elements of the array are close together in memory. Without that information, the compiler cannot vectorize efficiently.
是否可以通过Cython进行此类操作?
Is there a way to do such a thing from Cython?
推荐答案
您的代码中有两个问题(使用选项 -a
使其可见):
You have two problems in your code (use option -a
to make it visible):
- numpy数组的索引不是高效
- 您忘记了
中的
int
cdef sum = 0
- The indexing of numpy array isn't efficient
- You have forgotten
int
incdef sum=0
考虑到这一点,我们得到:
Taking this into account we get:
cpdef int f(np.ndarray[np.int_t] f): ##HERE
assert f.dtype == np.int
cdef int array_length = f.shape[0]
cdef int sum = 0 ##HERE
cdef int k
for k in range(array_length):
sum += f[k]
return sum
对于循环,以下代码:
int __pyx_t_5;
int __pyx_t_6;
Py_ssize_t __pyx_t_7;
....
__pyx_t_5 = __pyx_v_array_length;
for (__pyx_t_6 = 0; __pyx_t_6 < __pyx_t_5; __pyx_t_6+=1) {
__pyx_v_k = __pyx_t_6;
__pyx_t_7 = __pyx_v_k;
__pyx_v_sum = (__pyx_v_sum + (*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_int_t *, __pyx_pybuffernd_f.rcbuffer->pybuffer.buf, __pyx_t_7, __pyx_pybuffernd_f.diminfo[0].strides)));
}
那还不错,但对于优化器而言,却不如人类编写的普通代码那么容易。正如您已经指出的, __ pyx_pybuffernd_f.diminfo [0] .strides
在编译时是未知的,这会阻止矢量化。
Which is not that bad, but not as easy for the optimizer as the normal code written by human. As you have already pointed out, __pyx_pybuffernd_f.diminfo[0].strides
isn't known at compile time and this prevents vectorization.
但是,使用类型化的内存时,您会得到更好的结果视图,即:
cpdef int mf(int[::1] f):
cdef int array_length = len(f)
...
这会导致不透明的C -code-至少是我的编译器可以更好地优化:
which leads to a less opaque C-code - the one, at least my compiler, can better optimize:
__pyx_t_2 = __pyx_v_array_length;
for (__pyx_t_3 = 0; __pyx_t_3 < __pyx_t_2; __pyx_t_3+=1) {
__pyx_v_k = __pyx_t_3;
__pyx_t_4 = __pyx_v_k;
__pyx_v_sum = (__pyx_v_sum + (*((int *) ( /* dim=0 */ ((char *) (((int *) __pyx_v_f.data) + __pyx_t_4)) ))));
}
在这里最关键的一点是,我们要对赛顿说清楚,内存是连续的,即 int [:: 1]
与 int [:]
相比numpy-arrays,必须考虑可能的 stride!= 1
。
The most crucial thing here, is that we make it clear to the cython, that the memory is continuous, i.e. int[::1]
compared to int[:]
as it is seen for numpy-arrays, for which a possible stride!=1
must be taken into account.
在这种情况下, cython生成的C代码在同一汇编程序中与我会写的代码。正如crisb所指出的,添加 -march = native
将导致向量化,但是在这种情况下,两个函数的汇编器将再次有所不同。
In this case, the cython-generated C-code results in the same assembler as the code I would have written. As crisb has pointed out, adding -march=native
would lead to vectorization, but in this case the assembler of both functions would be slightly different again.
但是,以我的经验,编译器经常会遇到一些问题,无法优化由cython创建的循环,并且/或者容易遗漏细节,从而阻止生成真正好的C代码。因此,我处理工作循环的策略是用纯C语言编写它们,并使用cython来包装/访问它们-通常会更快一些,因为也可以使用专用的编译器标志来捕获此代码而不会影响整个Cython模块。
However, in my experience, compilers have quite often some problems to optimize loops created by cython and/or it is easier to miss a detail which prevents the generation of really good C-code. So my strategy for working-horse-loops is to write them in plain C and use cython for wrapping/accessing them - often it is somewhat faster, because one can also use dedicated compiler flags for this code-snipped without affecting the whole Cython-module.
这篇关于从Cython代码生成SIMD指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!