将Python序列输入Cython数组(并返回) [英] Typing Python sequence to Cython array (and back)

查看:121
本文介绍了将Python序列输入Cython数组(并返回)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我第一次成功使用Cython来显着加快将半字节从一个整数列表( bytes )打包到另一个整数中的速度(请参阅更快的位级数据打包),例如将两个连续的字节 0x0A 0x0B 打包到 0xAB 中。

I have successfully used Cython for the first time to significantly speed up packing nibbles from one list of integers (bytes) into another (see Faster bit-level data packing), e.g. packing the two sequential bytes 0x0A and 0x0B into 0xAB.

def pack(it):
    """Cythonize python nibble packing loop, typed"""
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    return [ (it[i*2]//16)<<4 | it[i*2+1]//16 for i in range(n) ]

结果速度令人满意,我很好奇是否可以通过更好地利用输入和输出列表来进一步提高此效果。

While the resulting speed is satisfactory, I am curious whether this can be taken further by making better use of the input and output lists.

cython3 -a pack .cyx 生成了一个非常 cythonic的HTML报告,很遗憾,我没有足够的经验来得出任何有用的结论。

cython3 -a pack.cyx generates a very "cythonic" HTML report that I unfortunately am not experienced enough to draw any useful conclusions from.

从C点开始循环应该简单地访问两个无符号的int数组。可能的是,使用更宽的数据类型(16/32位)可以进一步按比例加快此速度。

From a C point of view the loop should "simply" access two unsigned int arrays. Possibly, using a wider data type (16/32 bit) could further speed this up proportionally.

问题是:(如何) Python [binary / immutable]序列类型键入为 unsigned int array for Cython?

The question is: (how) can Python [binary/immutable] sequence types be typed as unsigned int array for Cython?

根据建议使用数组如何将python数组转换为cython数组?似乎没有使其更快(并且需要从字节创建数组)对象),也不会将参数键入 list 而不是 object (与无类型相同)或使用for循环而不是列表理解:

Using array as suggested in How to convert python array to cython array? does not seem to make it faster (and the array needs to be created from bytes object beforehand), nor does typing the parameter as list instead of object (same as no type) or using for loop instead of list comprehension:

def packx(list it):
    """Cythonize python nibble packing loop, typed"""
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    cdef list r = [0]*n
    for i in range(n):
        r[i] = (it[i*2]//16)<<4 | it[i*2+1]//16
    return r






我认为我先前的测试只是将array.array指定为输入,但是按照注释,我现在已经尝试


I think my earlier test just specified an array.array as input, but following the comments I've now just tried

from cpython cimport array
import array

def packa(array.array a):
    """Cythonize python nibble packing loop, typed"""
    cdef unsigned int n = len(a)//2
    cdef unsigned int i
    cdef unsigned int b[256*64/2]
    for i in range(n):
        b[i] = (a[i*2]//16)<<4 | a[i*2+1]//16;

    cdef array.array c = array.array("B", b)
    return c

可编译,但

ima = array.array("B", imd) # unsigned char (1 Byte)
pa = packa(ima)
packed = pa.tolist()

段错误。
我发现文档有点稀疏,因此希望了解这里存在的问题以及如何为输出数据分配数组。

segfaults. I find the documentation a bit sparse, so any hints on what the problem is here and how to allocate the array for output data are appreciated.



采用@ead的第一种方法,再加上除法和移位(似乎可以节省一微秒:


Taking @ead's first approach, plus combining division and shifting (seems to save a microsecond:

#cython: boundscheck=False, wraparound=False

def packa(char[::1] a):
    """Cythonize python nibble packing loop, typed with array"""

    cdef unsigned int n = len(a)//2
    cdef unsigned int i

    # cdef unsigned int b[256*64/2]
    cdef array.array res = array.array('B', [])
    array.resize(res, n)

    for i in range(n):
        res.data.as_chars[i] = ( a[i*2] & 0xF0 ) | (a[i*2+1] >> 4);

    return res

编译时间更长,但运行速度更快:

compiles much longer, but runs much faster:

python3 -m timeit -s 'from pack import packa; import array; data = array.array("B", bytes([0]*256*64))' 'packa(data)'
1000 loops, best of 3: 236 usec per loop

很棒!但是,有了额外的字节到数组和数组到列表的转换

Amazing! But, with the additional bytes-to-array and array-to-list conversion

ima = array.array("B", imd) # unsigned char (1 Byte)
pa = packa(ima)
packed = pa.tolist() # bytes would probably also do

现在只需大约1.7毫秒-非常酷!

it now only takes about 1.7 ms - very cool!

定时降低到大约150 us。实际0.4毫秒:

Down to 150 us timed or approx. 0.4 ms actual:

from cython cimport boundscheck, wraparound
from cpython cimport array
import array

@boundscheck(False)
@wraparound(False)
def pack(const unsigned char[::1] di):
    cdef:
        unsigned int i, n = len(di)
        unsigned char h, l, r
        array.array do = array.array('B')

    array.resize(do, n>>1)

    for i in range(0, n, 2):
        h = di[i] & 0xF0
        l = di[i+1] >> 4
        r = h | l
        do.data.as_uchars[i>>1] = r

    return do

我不再将结果数组转换为列表,这在写时由py-spidev自动完成,总时间大约相同:10 ms(@ 10 MHz)。 / p>

I'm not converting the result array to a list anymore, this is done automatically by py-spidev when writing, and the total time is about the same: 10 ms (@ 10 MHz).

推荐答案

如果您想和C一样快,则不应在列表中使用python-integers,而应使用 array.array 。通过使用cython + array.array ,可能会为您的python + list代码加快140左右的速度。

If you wanna to be as fast as C you should not use list with python-integers inside but an array.array. It is possible to get a speed-up of around 140 for your python+list code by using cython+array.array.

以下是一些如何使用cython加快代码编写速度的想法。作为基准,我选择了一个包含1000个元素的列表(足够大,并且缓存丢失没有影响):

Here are some ideas how to make your code faster with cython. As benchmark I choose a list with 1000 elements (big enough and cache-misses have no effects yet):

import random
l=[random.randint(0,15) for _ in range(1000)]

基线,您的python实现列表:

As baseline, your python-implementation with list:

def packx(it):
    n = len(it)//2
    r = [0]*n
    for i in range(n):
        r[i] = (it[i*2]%16)<<4 | it[i*2+1]%16
    return r

%timeit packx(l)
143 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

顺便说一句,我使用而不是 // ,这可能是您想要的,否则您将只得到 0 s结果(只有低位在您的描述中有数据)。

By the way, I use % instead of //, which is probably what you want, otherwise you would get only 0s as result (only lower bits have data in your description).

在对相同的功能进行cython之后(使用 %% cython -魔术)我们得到的加速比约为2:

After cythonizing the same function (with %%cython-magic) we get a speed-up of around 2:

%timeit packx(l)
77.6 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

看看由选项 -a 产生的html,我们看到与 for -loop对应的行的以下内容:

Let's look at the html produced by option -a, we see the following for the line corresponding to the for-loop:

..... 
__pyx_t_2 = PyNumber_Multiply(__pyx_v_i, __pyx_int_2); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 6, __pyx_L1_error)
 __Pyx_GOTREF(__pyx_t_2);
 __pyx_t_5 = PyObject_GetItem(__pyx_v_it, __pyx_t_2); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 6, __pyx_L1_error)
 __Pyx_GOTREF(__pyx_t_5);
 __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
 __pyx_t_2 = __Pyx_PyInt_RemainderObjC(__pyx_t_5, __pyx_int_16, 16, 0); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 6, __pyx_L1_error)
...

Py_NumberMultiply 表示我们使用慢速python乘法, Pyx_DECREF -所有临时对象都是慢速python对象。我们需要改变它!

Py_NumberMultiply means that we use slow python-multiplication, Pyx_DECREF- all temporaries are slow python-objects. We need to change that!

让我们传递一个列表而不是一个列表,而是将一个 array.array 个字节传递给我们的函数,并返回一个 array.array 个字节。列表中有完整的python对象, array.array 较低的原始c数据,速度更快:

Let's pass not a list but an array.array of bytes to our function and return an array.array of bytes back. Lists have full fledged python objects inside, array.arraythe lowly raw c-data which is faster:

%%cython
from cpython cimport array
def cy_apackx(char[::1] it):
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    cdef array.array res = array.array('b', [])
    array.resize(res, n)
    for i in range(n):
        res.data.as_chars[i] = (it[i*2]%16)<<4 | it[i*2+1]%16
    return res

import array
a=array.array('B', l)
%timeit cy_apackx(a)
19.2 µs ± 316 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

更好,但是让我们看一下生成的html,仍然有一些慢速的python代码:

Better, but let take a look at the generated html, there is still some slow python-code:

 __pyx_t_2 = __Pyx_PyInt_From_long(((__Pyx_mod_long((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_7)) ))), 16) << 4) | __Pyx_mod_long((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_8)) ))), 16))); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 9, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_2);
 if (unlikely(__Pyx_SetItemInt(((PyObject *)__pyx_v_res), __pyx_v_i, __pyx_t_2, unsigned int, 0, __Pyx_PyInt_From_unsigned_int, 0, 0, 1) < 0)) __PYX_ERR(0, 9, __pyx_L1_error)
 __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;

我们仍然对数组使用python-setter( __ Pax_SetItemInt ),为此需要python objecct __ pyx_t_2 ,为避免这种情况,我们使用 array.data.as_chars

We still use a python-setter for array (__Pax_SetItemInt) and for this a python objecct __pyx_t_2 is needed, to avoid this we use array.data.as_chars:

%%cython
from cpython cimport array
def cy_apackx(char[::1] it):
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    cdef array.array res = array.array('B', [])
    array.resize(res, n)
    for i in range(n):
        res.data.as_chars[i] = (it[i*2]%16)<<4 | it[i*2+1]%16  ##HERE!
return res

%timeit cy_apackx(a)
1.86 µs ± 30.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

好多了,但让我们再次看一下html,我们看到对 __ Pyx_RaiseBufferIndexError的一些调用-此安全功能需要花费一些时间,因此我们将其关闭:

Much better, but let's take a look at html again, and we see some calls to __Pyx_RaiseBufferIndexError - this safety costs some time, so let's switch it off:

%%cython
from cpython cimport array    
import cython
@cython.boundscheck(False) # switch of safety-checks
@cython.wraparound(False) # switch of safety-checks
def cy_apackx(char[::1] it):
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    cdef array.array res = array.array('B', [])
    array.resize(res, n)
    for i in range(n):
        res.data.as_chars[i] = (it[i*2]%16)<<4 | it[i*2+1]%16  ##HERE!
    return res

%timeit cy_apackx(a)
1.53 µs ± 11.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

当我们查看生成的html时,会看到:

When we look at the generated html, we see:

__pyx_t_7 = (__pyx_v_i * 2);
__pyx_t_8 = ((__pyx_v_i * 2) + 1);
(__pyx_v_res->data.as_chars[__pyx_v_i]) = ((__Pyx_mod_long((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_7)) ))), 16) << 4) | __Pyx_mod_long((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_8)) ))), 16));

没有python的东西!目前很好。但是,我不确定 __ Pyx_mod_long ,它的定义是:

No python-stuff! Good so far. However, I'm not sure about __Pyx_mod_long, its definition is:

static CYTHON_INLINE long __Pyx_mod_long(long a, long b) {
   long r = a % b;
   r += ((r != 0) & ((r ^ b) < 0)) * b;
   return r;
}

所以C和Python在 mod上有区别为负数,必须考虑在内。尽管是内联的,但此函数定义将阻止C编译器将 a%16 优化为 a& 15 。我们只有正数,所以不需要关心它们,因此我们需要自己做 a& 15 技巧:

So C and Python have differences for mod of negative numbers and it must be taken into account. This function-definition, albeit inlined, will prevent the C-compiler from optimizing a%16 as a&15. We have only positive numbers, so no need to care about them, thus we need to do the a&15-trick by ourselves:

%%cython
from cpython cimport array
import cython
@cython.boundscheck(False)
@cython.wraparound(False)
def cy_apackx(char[::1] it):
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    cdef array.array res = array.array('B', [])
    array.resize(res, n)
    for i in range(n):
        res.data.as_chars[i] = (it[i*2]&15)<<4 | (it[i*2+1]&15)
    return res

%timeit cy_apackx(a)
1.02 µs ± 8.63 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

我也对生成的C代码感到满意/ html(仅一行):

I'm also satified with the resulting C-code/html (only one line):

(__pyx_v_res->data.as_chars[__pyx_v_i]) = ((((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_7)) ))) & 15) << 4) | ((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_8)) ))) & 15));

结论:总而言之,这将加速140(140 µs) vs 1.02 µs)-不错!另一个有趣的点是:计算本身大约需要2 µs(包括少于最佳边界检查和划分的时间)-138 µs用于创建,注册和删除临时python对象。

Conclusion: In the sum that means speed up of 140 (140 µs vs 1.02 µs)- not bad! Another interesting point: the calculation itself takes about 2 µs (and that comprises less than optimal bound checking and division) - 138 µs are for creating, registering and deleting temporary python objects.

如果您需要较高的位并且可以假设较低的位没有污垢(否则& 250 可以帮助您),您可以使用:

If you need the upper bits and can assume that lower bits are without dirt (otherwise &250 can help), you can use:

from cpython cimport array
import cython
@cython.boundscheck(False)
@cython.wraparound(False)
def cy_apackx(char[::1] it):
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    cdef array.array res = array.array('B', [])
    array.resize(res, n)
    for i in range(n):
        res.data.as_chars[i] = it[i*2] | (it[i*2+1]>>4)
    return res

%timeit cy_apackx(a)
819 ns ± 8.24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)






另一个有趣的问题是,如果使用列表,则哪些成本在进行操作。如果我们从改进版本开始:


Another interesting question is which costs have the operations if list is used. If we start with the "improved" version:

%%cython
def cy_packx(it):
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    res=[0]*n
    for i in range(n):
        res[i] = it[i*2] | (it[i*2+1]>>4))
    return res

%timeit cy_packx(l)
20.7 µs ± 450 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

我们看到,减少了整数的数量操作会大大加快速度。这是由于以下事实:python-integers是不可变的,并且每个操作都会创建一个新的临时对象,这是很昂贵的。

we see, that reducing the number of integer operation leads to a big speed-up. That is due to the fact, that python-integers are immutable and every operation creates a new temporary object, which is costly. Eliminating operations means also eliminating costly temporaries.

但是, it [i * 2] | (it [i * 2 + 1]> 4)是使用python-integer完成的,下一步,我们将其设置为 cdef -操作:

However, it[i*2] | (it[i*2+1]>>4) is done with python-integer, as next step we make it cdef-operations:

%%cython   
def cy_packx(it):
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    cdef unsigned char a,b
    res=[0]*n
    for i in range(n):
        a=it[i*2]
        b=it[i*2+1]  # ensures next operations are fast
        res[i]= a | (b>>4)
    return res   

    %timeit cy_packx(l)
    7.3 µs ± 880 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

我不知道如何进一步改进,因此我们有7.3 µs列表与 array.array 的1 µs对比。

I don't know how it can be improved further, thus we have 7.3 µs for list vs. 1 µs for array.array.

Last问题,清单版本的费用为何?为了避免被C编译器优化,我们使用了稍微不同的基线函数:

Last question, what is the costs break down of the list version? I order to avoid being optimized away by the C-compiler, we use a slightly different baseline function:

%%cython
def cy_packx(it):
        cdef unsigned int n = len(it)//2
        cdef unsigned int i
        cdef unsigned char a,b
        cdef unsigned char s = 0
        res=[0]*n
        for i in range(n):
            a=it[i*2]
            b=it[i*2+1]  # ensures next operations are fast
            s+=a | (b>>4)
            res[i]= s
        return res
%timeit cy_packx(l)

In [79]: %timeit cy_packx(l)
7.67 µs ± 106 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

使用 s 变量意味着,它在第二个版本中并未得到优化:

The usage of the s variable means, it does not get optimized away in the second version:

%%cython   
def cy_packx(it):
        cdef unsigned int n = len(it)//2
        cdef unsigned int i
        cdef unsigned char a,b
        cdef unsigned char s = 0
        res=[0]*n
        for i in range(n):
            a=it[i*2]
            b=it[i*2+1]  # ensures next operations are fast
            s+=a | (b>>4)
        res[0]=s
        return res 

In [81]: %timeit cy_packx(l)
5.46 µs ± 72.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

大约2 µs或大约30%是创建新的整数对象的成本。内存分配的成本是多少?

About 2 µs or about 30% are the costs for creating new integer objects. What are the costs of the memory allocation?

%%cython   
def cy_packx(it):
        cdef unsigned int n = len(it)//2
        cdef unsigned int i
        cdef unsigned char a,b
        cdef unsigned char s = 0
        for i in range(n):
            a=it[i*2]
            b=it[i*2+1]  # ensures next operations are fast
            s+=a | (b>>4)
        return s 

In [84]: %timeit cy_packx(l)
3.84 µs ± 43.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

这会导致列表版本的以下性能崩溃:

That leads to the following performance break down of the list-version:

                    Time(in µs)      Percentage(in %)
     all                7.7                 100
     calculation          1                  12
     alloc memory       1.6                  21
     create ints        2.2                  29
     access data/cast   2.6                  38

我必须承认,我希望 create ints 发挥更大的作用,并且不访问列表中的数据并将其强制转换为 char 会花那么多钱。

I must confess, I expected create ints to play a bigger role and didn't thing accessing the data in the list and casting it to chars will cost that much.

这篇关于将Python序列输入Cython数组(并返回)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆