用cython比struct.pack更快 [英] Going faster than struct.pack with cython
问题描述
我想做的要比 struct.pack
好.
Taking a specific case of packing integeres, via the answer to this question, I have the following to pack a list of ints in pack_ints.pyx
:
# cython: language_level=3, boundscheck=False
import cython
@cython.boundscheck(False)
@cython.wraparound(False)
def pack_ints(int_col):
int_buf = bytearray(4*len(int_col))
cdef int[::1] buf_view = memoryview(int_buf).cast('i')
idx: int = 0
for idx in range(len(int_col)):
buf_view[idx] = int_col[idx]
return int_buf
在ipython中使用以下测试代码:
With this test code in ipython:
from struct import pack
import pyximport; pyximport.install(language_level=3)
import pack_ints
amount = 10**7
ints = list(range(amount))
res1 = pack(f'{amount}i', *ints)
res2 = pack_ints.pack_ints(ints)
assert(res1 == res2)
%timeit pack(f'{amount}i', *ints)
%timeit pack_ints.pack_ints(ints)
我得到:
304 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
212 ms ± 6.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
我尝试将 int_buf
键入为 array('b')
,但没有发现任何改进.
I tried to type int_buf
as an array('b')
, but didn't see an improvement.
是否有其他方法可以对此进行改进,或者以其他方式使用cython,以使此操作更快?
Is there any other way to improve upon this, or use cython in a different way, to make this operation faster?
推荐答案
此答案试图给出一个估计,即并行化版本可以提高多少速度.但是,由于此任务是受内存带宽限制的(Python整数对象至少占用32个字节,并且可以分散在内存中,因此会有很多缓存未命中),因此我们不应该期望太多.
This answer tries to give an estimation, how much speed-up a parallelized version can yield. However, because this task is memory-bandwidth bound (Python-integer-objects take at least 32 bytes and can be scattered in memory, so there will be many cache misses) we should not expect much.
第一个问题是如何处理错误(元素不是整数或值太大).我将遵循策略/简化:当对象
The first problem is, how to handle errors (element is not an integer or the value is too large). I will be following strategy/simplification: when object
- 不是整数
- 是负整数,
- 或整数为> = 2 ^ 30
它将被强制转换为一个特殊数字( -1
),该数字表示出现问题.只允许非负整数< 2 ^ 30
使我的生活更轻松,因为我必须重新实现
it will be casted to a special number (-1
) which signals that something went wrong. Allowing only non-negative integers <2^30
makes my life easier, as I have to reimplement PyLong_AsLongAndOverflow
whitout the raising errors and otherwise detecting overflows is often cumbersome (however, see the version at the end of the answer for a more sophisticated approach).
可以找到这里:
struct _longobject {
PyObject_VAR_HEAD
digit ob_digit[1];
};
成员 ob_size
/macro Py_SIZE
告诉我们在整数表示中使用了30位数字(对于负整数,
Member ob_size
/macro Py_SIZE
tells us how many 30-bit digits are used in the representation of the integer(ob_size
is negative for negative integer).
因此,我的简单规则转化为以下C代码(我使用C而不是Cython,因为它是使用Python的C-API的更简单/更自然的方式):
My simple rule thus translates to the following C-code (I use rather C than Cython, as it is a simpler/more natural way of using Python's C-API):
#include <Python.h>
// returns -1 if vv is not an integer,
// negative, or > 2**30-1
int to_int(PyObject *vv){
if (PyLong_Check(vv)) {
PyLongObject * v = (PyLongObject *)vv;
Py_ssize_t i = Py_SIZE(v);
if(i==0){
return 0;
}
if(i==1){//small enought for a digit
return v->ob_digit[0];
}
//negative (i<0) or too big (i>1)
return -1;
}
return -1;
}
现在给定一个列表,我们可以将其转换为与以下使用omp的C函数并行的 int
缓冲区:
Now given a list, we can convert it to an int
-buffer in parallel with the following C-function, which uses omp:
void convert_list(PyListObject *lst, int *output){
Py_ssize_t n = Py_SIZE(lst);
PyObject **data = lst->ob_item;
#pragma omp parallel for
for(Py_ssize_t i=0; i<n; ++i){
output[i] = to_int(data[i]);
}
}
没什么可说的- PyListObject
-API 用于并行访问列表的元素.之所以可以这样做,是因为 to_int
函数中没有引用计数/竞猜条件.
There is not much to say - PyListObject
-API is used to access the elements of the list in parallel. It can be done, because there are no ref counting/racing conditions in to_int
-function.
现在,将其与Cython捆绑在一起:
Now, bundling it all together with Cython:
%%cython -c=-fopenmp --link-args=-fopenmp
import cython
cdef extern from *:
"""
#include <Python.h>
int to_int(PyObject *vv){
... code
}
void convert_list(PyListObject *lst, int *output){
... code
}
"""
void convert_list(list lst, int *output)
@cython.boundscheck(False)
@cython.wraparound(False)
def pack_ints_ead(list int_col):
cdef char[::1] int_buf = bytearray(4*len(int_col))
convert_list(int_col, <int*>(&int_buf[0]))
return int_buf.base
一个重要的细节是: convert_list
一定不要傻瓜(因为它不是)!Omp线程和Python线程(受GIL影响)完全不同.
One important detail is: convert_list
must not be nogil (because it isn't)! Omp threads and Python-threads (which are affected by GIL) are completly different things.
One can (but there is no must) release GIL for omp-operations while using objects with buffer-protocol - because those objects get locked via buffer-protocol and cannot be changed from different Python-threads. A list
has no such locking mechanism and thus, if GIL were released, the list could be changed in another threads and all our pointers could get invalidated.
现在开始计时(列表稍大):
So now to timings (with a slightly bigger list):
amount = 5*10**7
ints = list(range(amount))
%timeit pack(f'{amount}i', *ints)
# 1.51 s ± 38.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pack_ints_DavidW(ints)
# 284 ms ± 3.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pack_ints_ead(ints)
# 177 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
btw关闭 pack_ints_ead
的并行化将导致209毫秒的运行时间.
btw turning off the parallelization for pack_ints_ead
leads to running time of 209 ms.
因此,考虑到ca的适度改进.33%,我会选择功能更强大的DavidW解决方案.
So given the modest improvement of ca. 33%, I would opt for the more robust DavidW's solution.
这里的实现方式与错误值的信号发送方式略有不同:
Here is implementation with a slightly different way of signaling wrong values:
- 不是整数对象会导致
-2147483648
(即0x80000000
)-32位整数可以存储的最小负值. - integers
> = 2147483647
(即> = 0x7fffffff
)将映射到/存储为2147483647
-最大的正数a可以存储32bit-int. - 整数
< =-2147483647
(即< = 0x80000001
)将映射到/存储为-2147483647
- 所有其他整数都映射到其正确值上.
- not an integer object results in
-2147483648
(i.e.0x80000000
) - the smallest negative value a 32bit-int can store. - integers
>=2147483647
(i.e.>=0x7fffffff
) will be mapped to/stored as2147483647
- the biggest positive number a 32bit-int can store. - integers
<=-2147483647
(i.e.<=0x80000001
) will be mapped to/stored as-2147483647
- all other integer are mapped on their correct value.
主要优点是,它可以在较大范围的整数值中正常工作.该算法产生的运行时间几乎与第一个简单版本的运行时间相同(可能会慢2-3%):
The main advantage is, that it works correctly for a larger range of integer-values. This algorithm yields almost the same running time (maybe 2-3% slower) as the first, simple version:
int to_int(PyObject *vv){
if (PyLong_Check(vv)) {
PyLongObject * v = (PyLongObject *)vv;
Py_ssize_t i = Py_SIZE(v);
int sign = i<0 ? -1 : 1;
i = abs(i);
if(i==0){
return 0;
}
if(i==1){//small enought for a digit
return sign*v->ob_digit[0];
}
if(i==2 && (v->ob_digit[1]>>1)==0){
int add = (v->ob_digit[1]&1) << 30;
return sign*(v->ob_digit[0]+add);
}
return sign * 0x7fffffff;
}
return 0x80000000;
}
这篇关于用cython比struct.pack更快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!