简单的Python挑战:最快位异或的数据缓冲器 [英] Simple Python Challenge: Fastest Bitwise XOR on Data Buffers

查看:297
本文介绍了简单的Python挑战:最快位异或的数据缓冲器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

挑战

在两个同等大小的缓冲区执行按位异或运算。该缓冲区将被要求是蟒蛇 STR 类型,因为这是传统上在Python数据缓冲区的类型。返回所得到的值作为 STR 。这样做尽可能快。

输入是两个1兆字节(2 ** 20字节)的字符串。

目前的挑战是为基本打我使用python的或现有的第三方Python模块效率低下的算法(放宽的规定:或创建自己的模块)。边际增加是没有用的。

 从OS进口urandom的
从numpy的进口frombuffer,bitwise_xor,字节

高清slow_xor(AA,BB):
    A = frombuffer(AA,DTYPE =字节)
    B = frombuffer(BB,DTYPE =字节)
    C = bitwise_xor(A,B)
    R = c.tostring()
    回报 -  [R

AA = urandom的(2 ** 20)
BB = urandom的(2 ** 20)

高清test_it():
    对于x中的xrange(1000):
        slow_xor(AA,BB)
 

解决方案

第一次尝试

使用 scipy.weave 和的SSE2 内部函数给出了略有改善。第一个调用是有点慢,因为code需要从磁盘加载并缓存,后续调用更快:

 进口numpy的
进口时间
从OS进口urandom的
从SciPy的进口编织

SIZE = 2 ** 20

高清faster_slow_xor(AA,BB):
    B = numpy.fromstring(BB,DTYPE = numpy.uint64)
    numpy.bitwise_xor(numpy.frombuffer(AA,DTYPE = numpy.uint64),B,B)
    返回b.tostring()

code =
常量__m128i * PA =(__m128i *)一个;
常量__m128i *挂起=(__m128i *)(A + arr_size);
__m128i * PB =(__m128i *)B:
__m128i xmm1中,XMM2;
而(PA<挂起){
  xmm1中= _mm_loadu_si128(PA); //必须使用对齐访问
  XMM2 = _mm_load_si128(PB); // numpy的将调整为16字节边界
  _mm_store_si128(PB,_mm_xor_si128(xmm1中,XMM2));
  PA ++;
  + PB;
}


高清inline_xor(AA,BB):
    A = numpy.frombuffer(AA,DTYPE = numpy.uint64)
    B = numpy.fromstring(BB,DTYPE = numpy.uint64)
    arr_size = a.shape [0]
    weave.inline(code,[一个,B,arr_size],标头= ['emmintrin.h'])
    返回b.tostring()
 

第二次尝试

考虑到的意见,我重新审视了code,以找出是否能避免复制。原来我读了字符串对象错误的文件,所以在这里不用我的第二次​​尝试:

 支持=
#定义对准16
静态无效memxor(为const char * IN1,为const char * IN2,字符*出来,ssize_t供N){
    为const char *年底= IN1 + N;
    而(IN1<结束){
       *出* = IN1 ^ * IN2;
       ++ IN1;
       ++平方英寸;
       ++出来;
    }
}


code2 =
的PyObject *解析度= PyString_FromStringAndSize(NULL,real_size);

常量为ssize_t尾=(ssize_t供)PyString_AS_STRING(RES)%一致;
常量为ssize_t头=(对齐 - 尾)%一致;

memxor((为const char *)A,(为const char *)B,PyString_AS_STRING(RES),头);

常量__m128i * PA =(__m128i *)((字符*)A +头);
常量__m128i *挂起=(__m128i *)((字符*)A + real_size  - 尾);
常量__m128i * PB =(__m128i *)((字符*)B +头);
__m128i xmm1中,XMM2;
__m128i * PC =(__m128i *)(PyString_AS_STRING(RES)+头);
而(PA<挂起){
    xmm1中= _mm_loadu_si128(PA);
    XMM2 = _mm_loadu_si128(PB);
    _mm_stream_si128(PC,_mm_xor_si128(xmm1中,XMM2));
    PA ++;
    + PB;
    + PC;
}
memxor((为const char *)PA,(为const char *)PB,(字符*)PC,尾);
RETURN_VAL =资源;
Py_DECREF(RES);


高清inline_xor_nocopy(AA,BB):
    real_size = LEN(AA)
    A = numpy.frombuffer(AA,DTYPE = numpy.uint64)
    B = numpy.frombuffer(BB,DTYPE = numpy.uint64)
    返回weave.inline(code2,[一个,B,real_size],
                        标题= ['emmintrin.h'],
                        support_ code =支持)
 

不同的是,该字符串被分配内的C $ C $角它不可能有它在一个16字节边界对齐所要求的SSE2指令,因此在开始时未对齐的存储区域和端使用的字节方式的访问被复制。

输入数据交给使用numpy的数组,无论如何,因为坚持复制的Python STR 对象的std ::字符串秒。 frombuffer 不复制,所以这是好的,但内存是不是在16字节对齐的,所以我们需要使用 _mm_loadu_si128 而不是更快的 _mm_load_si128

而不是使用 _mm_store_si128 ,我们使用 _mm_stream_si128 ,这将确保任何写操作传输到主内存尽快---这种方式,输出阵列不使用宝贵的高速缓存行。

时序

对于时序,在第一编辑 slow_xor 项提到我的改进版本(行内按位异或, UINT64 ),我删除了混乱。 slow_xor 指的是从原来的问题code。所有的时序都做了1000道。

  • slow_xor :1.85s(1X)
  • faster_slow_xor :1.25秒(1.48x)
  • inline_xor :0.95s(1.95x)
  • inline_xor_nocopy :0.32s(5.78x)

在code编译使用gcc 4.4.3,我已经验证了该编译器实际使用SSE指令。

Challenge:

Perform a bitwise XOR on two equal sized buffers. The buffers will be required to be the python str type since this is traditionally the type for data buffers in python. Return the resultant value as a str. Do this as fast as possible.

The inputs are two 1 megabyte (2**20 byte) strings.

The challenge is to substantially beat my inefficient algorithm using python or existing third party python modules (relaxed rules: or create your own module.) Marginal increases are useless.

from os import urandom
from numpy import frombuffer,bitwise_xor,byte

def slow_xor(aa,bb):
    a=frombuffer(aa,dtype=byte)
    b=frombuffer(bb,dtype=byte)
    c=bitwise_xor(a,b)
    r=c.tostring()
    return r

aa=urandom(2**20)
bb=urandom(2**20)

def test_it():
    for x in xrange(1000):
        slow_xor(aa,bb)

解决方案

First Try

Using scipy.weave and SSE2 intrinsics gives a marginal improvement. The first invocation is a bit slower since the code needs to be loaded from the disk and cached, subsequent invocations are faster:

import numpy
import time
from os import urandom
from scipy import weave

SIZE = 2**20

def faster_slow_xor(aa,bb):
    b = numpy.fromstring(bb, dtype=numpy.uint64)
    numpy.bitwise_xor(numpy.frombuffer(aa,dtype=numpy.uint64), b, b)
    return b.tostring()

code = """
const __m128i* pa = (__m128i*)a;
const __m128i* pend = (__m128i*)(a + arr_size);
__m128i* pb = (__m128i*)b;
__m128i xmm1, xmm2;
while (pa < pend) {
  xmm1 = _mm_loadu_si128(pa); // must use unaligned access 
  xmm2 = _mm_load_si128(pb); // numpy will align at 16 byte boundaries
  _mm_store_si128(pb, _mm_xor_si128(xmm1, xmm2));
  ++pa;
  ++pb;
}
"""

def inline_xor(aa, bb):
    a = numpy.frombuffer(aa, dtype=numpy.uint64)
    b = numpy.fromstring(bb, dtype=numpy.uint64)
    arr_size = a.shape[0]
    weave.inline(code, ["a", "b", "arr_size"], headers = ['"emmintrin.h"'])
    return b.tostring()

Second Try

Taking into account the comments, I revisited the code to find out if the copying could be avoided. Turns out I read the documentation of the string object wrong, so here goes my second try:

support = """
#define ALIGNMENT 16
static void memxor(const char* in1, const char* in2, char* out, ssize_t n) {
    const char* end = in1 + n;
    while (in1 < end) {
       *out = *in1 ^ *in2;
       ++in1; 
       ++in2;
       ++out;
    }
}
"""

code2 = """
PyObject* res = PyString_FromStringAndSize(NULL, real_size);

const ssize_t tail = (ssize_t)PyString_AS_STRING(res) % ALIGNMENT;
const ssize_t head = (ALIGNMENT - tail) % ALIGNMENT;

memxor((const char*)a, (const char*)b, PyString_AS_STRING(res), head);

const __m128i* pa = (__m128i*)((char*)a + head);
const __m128i* pend = (__m128i*)((char*)a + real_size - tail);
const __m128i* pb = (__m128i*)((char*)b + head);
__m128i xmm1, xmm2;
__m128i* pc = (__m128i*)(PyString_AS_STRING(res) + head);
while (pa < pend) {
    xmm1 = _mm_loadu_si128(pa);
    xmm2 = _mm_loadu_si128(pb);
    _mm_stream_si128(pc, _mm_xor_si128(xmm1, xmm2));
    ++pa;
    ++pb;
    ++pc;
}
memxor((const char*)pa, (const char*)pb, (char*)pc, tail);
return_val = res;
Py_DECREF(res);
"""

def inline_xor_nocopy(aa, bb):
    real_size = len(aa)
    a = numpy.frombuffer(aa, dtype=numpy.uint64)
    b = numpy.frombuffer(bb, dtype=numpy.uint64)
    return weave.inline(code2, ["a", "b", "real_size"], 
                        headers = ['"emmintrin.h"'], 
                        support_code = support)

The difference is that the string is allocated inside the C code. It's impossible to have it aligned at a 16-byte-boundary as required by the SSE2 instructions, therefore the unaligned memory regions at the beginning and the end are copied using byte-wise access.

The input data is handed in using numpy arrays anyway, because weave insists on copying Python str objects to std::strings. frombuffer doesn't copy, so this is fine, but the memory is not aligned at 16 byte, so we need to use _mm_loadu_si128 instead of the faster _mm_load_si128.

Instead of using _mm_store_si128, we use _mm_stream_si128, which will make sure that any writes are streamed to main memory as soon as possible---this way, the output array does not use up valuable cache lines.

Timings

As for the timings, the slow_xor entry in the first edit referred to my improved version (inline bitwise xor, uint64), I removed that confusion. slow_xor refers to the code from the original questions. All timings are done for 1000 runs.

  • slow_xor: 1.85s (1x)
  • faster_slow_xor: 1.25s (1.48x)
  • inline_xor: 0.95s (1.95x)
  • inline_xor_nocopy: 0.32s (5.78x)

The code was compiled using gcc 4.4.3 and I've verified that the compiler actually uses the SSE instructions.

这篇关于简单的Python挑战:最快位异或的数据缓冲器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆