用cython比struct.pack更快 [英] Going faster than struct.pack with cython

查看:90
本文介绍了用cython比struct.pack更快的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想做的要比 struct.pack 好.

通过打包

Taking a specific case of packing integeres, via the answer to this question, I have the following to pack a list of ints in pack_ints.pyx:

# cython: language_level=3, boundscheck=False
import cython

@cython.boundscheck(False)
@cython.wraparound(False)
def pack_ints(int_col):

    int_buf = bytearray(4*len(int_col))
    cdef int[::1] buf_view = memoryview(int_buf).cast('i')

    idx: int = 0
    for idx in range(len(int_col)):
        buf_view[idx] = int_col[idx]


    return int_buf

在ipython中使用以下测试代码:

With this test code in ipython:

from struct import pack 
import pyximport; pyximport.install(language_level=3) 
import pack_ints 

amount = 10**7 
ints = list(range(amount)) 

res1 = pack(f'{amount}i', *ints) 
res2 = pack_ints.pack_ints(ints) 
assert(res1 == res2) 

%timeit pack(f'{amount}i', *ints)  
%timeit pack_ints.pack_ints(ints)      

我得到:

304 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
212 ms ± 6.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

我尝试将 int_buf 键入为 array('b'),但没有发现任何改进.

I tried to type int_buf as an array('b'), but didn't see an improvement.

是否有其他方法可以对此进行改进,或者以其他方式使用cython,以使此操作更快?

Is there any other way to improve upon this, or use cython in a different way, to make this operation faster?

推荐答案

此答案试图给出一个估计,即并行化版本可以提高多少速度.但是,由于此任务是受内存带宽限制的(Python整数对象至少占用32个字节,并且可以分散在内存中,因此会有很多缓存未命中),因此我们不应该期望太多.

This answer tries to give an estimation, how much speed-up a parallelized version can yield. However, because this task is memory-bandwidth bound (Python-integer-objects take at least 32 bytes and can be scattered in memory, so there will be many cache misses) we should not expect much.

第一个问题是如何处理错误(元素不是整数或值太大).我将遵循策略/简化:当对象

The first problem is, how to handle errors (element is not an integer or the value is too large). I will be following strategy/simplification: when object

  • 不是整数
  • 是负整数,
  • 或整数为> = 2 ^ 30

它将被强制转换为一个特殊数字( -1 ),该数字表示出现问题.只允许非负整数< 2 ^ 30 使我的生活更轻松,因为我必须重新实现

it will be casted to a special number (-1) which signals that something went wrong. Allowing only non-negative integers <2^30 makes my life easier, as I have to reimplement PyLong_AsLongAndOverflow whitout the raising errors and otherwise detecting overflows is often cumbersome (however, see the version at the end of the answer for a more sophisticated approach).

可以找到这里:

struct _longobject {
    PyObject_VAR_HEAD
    digit ob_digit[1];
};

成员 ob_size /macro Py_SIZE 告诉我们在整数表示中使用了30位数字(对于负整数, size_code 为负).

Member ob_size/macro Py_SIZE tells us how many 30-bit digits are used in the representation of the integer(ob_size is negative for negative integer).

因此,我的简单规则转化为以下C代码(我使用C而不是Cython,因为它是使用Python的C-API的更简单/更自然的方式):

My simple rule thus translates to the following C-code (I use rather C than Cython, as it is a simpler/more natural way of using Python's C-API):

#include <Python.h>

// returns -1 if vv is not an integer,
//            negative, or > 2**30-1
int to_int(PyObject *vv){ 
   if (PyLong_Check(vv)) {
       PyLongObject * v = (PyLongObject *)vv;
       Py_ssize_t i = Py_SIZE(v);
       if(i==0){
           return 0;
       }
       if(i==1){//small enought for a digit
           return v->ob_digit[0];
       }
       //negative (i<0) or too big (i>1)
       return -1;
   }
   return -1;
}

现在给定一个列表,我们可以将其转换为与以下使用omp的C函数并行的 int 缓冲区:

Now given a list, we can convert it to an int-buffer in parallel with the following C-function, which uses omp:

void convert_list(PyListObject *lst, int *output){
    Py_ssize_t n = Py_SIZE(lst);
    PyObject **data = lst->ob_item;
    #pragma omp parallel for
    for(Py_ssize_t i=0; i<n; ++i){
        output[i] = to_int(data[i]);
    }
}

没什么可说的- PyListObject -API 用于并行访问列表的元素.之所以可以这样做,是因为 to_int 函数中没有引用计数/竞猜条件.

There is not much to say - PyListObject-API is used to access the elements of the list in parallel. It can be done, because there are no ref counting/racing conditions in to_int-function.

现在,将其与Cython捆绑在一起:

Now, bundling it all together with Cython:

%%cython -c=-fopenmp --link-args=-fopenmp
import cython

cdef extern from *:
    """
    #include <Python.h>

    int to_int(PyObject *vv){ 
       ... code
    }

    void convert_list(PyListObject *lst, int *output){
        ... code
    }
    """
    void convert_list(list lst, int *output)

@cython.boundscheck(False)
@cython.wraparound(False)
def pack_ints_ead(list int_col):
    cdef char[::1] int_buf = bytearray(4*len(int_col))
    convert_list(int_col, <int*>(&int_buf[0]))
    return int_buf.base

一个重要的细节是: convert_list 一定不要傻瓜(因为它不是)!Omp线程和Python线程(受GIL影响)完全不同.

One important detail is: convert_list must not be nogil (because it isn't)! Omp threads and Python-threads (which are affected by GIL) are completly different things.

在使用

One can (but there is no must) release GIL for omp-operations while using objects with buffer-protocol - because those objects get locked via buffer-protocol and cannot be changed from different Python-threads. A list has no such locking mechanism and thus, if GIL were released, the list could be changed in another threads and all our pointers could get invalidated.

现在开始计时(列表稍大):

So now to timings (with a slightly bigger list):

amount = 5*10**7 
ints = list(range(amount)) 


%timeit pack(f'{amount}i', *ints)  
# 1.51 s ± 38.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pack_ints_DavidW(ints) 
# 284 ms ± 3.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pack_ints_ead(ints) 
# 177 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

btw关闭 pack_ints_ead 的并行化将导致209毫秒的运行时间.

btw turning off the parallelization for pack_ints_ead leads to running time of 209 ms.

因此,考虑到ca的适度改进.33%,我会选择功能更强大的DavidW解决方案.

So given the modest improvement of ca. 33%, I would opt for the more robust DavidW's solution.

这里的实现方式与错误值的信号发送方式略有不同:

Here is implementation with a slightly different way of signaling wrong values:

  • 不是整数对象会导致 -2147483648 (即 0x80000000 )-32位整数可以存储的最小负值.
  • integers > = 2147483647 (即> = 0x7fffffff )将映射到/存储为 2147483647 -最大的正数a可以存储32bit-int.
  • 整数< =-2147483647 (即< = 0x80000001 )将映射到/存储为 -2147483647
  • 所有其他整数都映射到其正确值上.
  • not an integer object results in -2147483648(i.e. 0x80000000) - the smallest negative value a 32bit-int can store.
  • integers >=2147483647 (i.e. >=0x7fffffff) will be mapped to/stored as 2147483647 - the biggest positive number a 32bit-int can store.
  • integers <=-2147483647 (i.e. <=0x80000001) will be mapped to/stored as -2147483647
  • all other integer are mapped on their correct value.

主要优点是,它可以在较大范围的整数值中正常工作.该算法产生的运行时间几乎与第一个简单版本的运行时间相同(可能会慢2-3%):

The main advantage is, that it works correctly for a larger range of integer-values. This algorithm yields almost the same running time (maybe 2-3% slower) as the first, simple version:

int to_int(PyObject *vv){ 
   if (PyLong_Check(vv)) {
       PyLongObject * v = (PyLongObject *)vv;
       Py_ssize_t i = Py_SIZE(v);
       int sign = i<0 ? -1 : 1;
       i = abs(i);
       if(i==0){
           return 0;
       }
       if(i==1){//small enought for a digit
           return sign*v->ob_digit[0];
       }
       if(i==2 && (v->ob_digit[1]>>1)==0){
           int add = (v->ob_digit[1]&1) << 30;
           return sign*(v->ob_digit[0]+add);
       }
       return sign * 0x7fffffff;
   }
   return 0x80000000;
}

这篇关于用cython比struct.pack更快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆