简单的 Python 挑战:数据缓冲区上最快的按位异或 [英] Simple Python Challenge: Fastest Bitwise XOR on Data Buffers

查看:19
本文介绍了简单的 Python 挑战:数据缓冲区上最快的按位异或的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

挑战:

对两个相同大小的缓冲区执行按位异或.缓冲区将需要是 python str 类型,因为这通常是 python 中数据缓冲区的类型.将结果值作为 str 返回.尽快执行此操作.

Perform a bitwise XOR on two equal sized buffers. The buffers will be required to be the python str type since this is traditionally the type for data buffers in python. Return the resultant value as a str. Do this as fast as possible.

输入是两个 1 兆字节(2**20 字节)的字符串.

The inputs are two 1 megabyte (2**20 byte) strings.

挑战在于使用 python 或现有的第三方 python 模块大幅击败我的低效算法(宽松规则:或创建自己的模块.)边际增加是无用的.

The challenge is to substantially beat my inefficient algorithm using python or existing third party python modules (relaxed rules: or create your own module.) Marginal increases are useless.

from os import urandom
from numpy import frombuffer,bitwise_xor,byte

def slow_xor(aa,bb):
    a=frombuffer(aa,dtype=byte)
    b=frombuffer(bb,dtype=byte)
    c=bitwise_xor(a,b)
    r=c.tostring()
    return r

aa=urandom(2**20)
bb=urandom(2**20)

def test_it():
    for x in xrange(1000):
        slow_xor(aa,bb)

推荐答案

First Try

使用 scipy.weaveSSE2 内在函数略有改进.第一次调用有点慢,因为代码需要从磁盘加载并缓存,后续调用速度更快:

First Try

Using scipy.weave and SSE2 intrinsics gives a marginal improvement. The first invocation is a bit slower since the code needs to be loaded from the disk and cached, subsequent invocations are faster:

import numpy
import time
from os import urandom
from scipy import weave

SIZE = 2**20

def faster_slow_xor(aa,bb):
    b = numpy.fromstring(bb, dtype=numpy.uint64)
    numpy.bitwise_xor(numpy.frombuffer(aa,dtype=numpy.uint64), b, b)
    return b.tostring()

code = """
const __m128i* pa = (__m128i*)a;
const __m128i* pend = (__m128i*)(a + arr_size);
__m128i* pb = (__m128i*)b;
__m128i xmm1, xmm2;
while (pa < pend) {
  xmm1 = _mm_loadu_si128(pa); // must use unaligned access 
  xmm2 = _mm_load_si128(pb); // numpy will align at 16 byte boundaries
  _mm_store_si128(pb, _mm_xor_si128(xmm1, xmm2));
  ++pa;
  ++pb;
}
"""

def inline_xor(aa, bb):
    a = numpy.frombuffer(aa, dtype=numpy.uint64)
    b = numpy.fromstring(bb, dtype=numpy.uint64)
    arr_size = a.shape[0]
    weave.inline(code, ["a", "b", "arr_size"], headers = ['"emmintrin.h"'])
    return b.tostring()

第二次尝试

考虑到这些评论,我重新访问了代码以了解是否可以避免复制.结果我读错了字符串对象的文档,所以这是我的第二次尝试:

Second Try

Taking into account the comments, I revisited the code to find out if the copying could be avoided. Turns out I read the documentation of the string object wrong, so here goes my second try:

support = """
#define ALIGNMENT 16
static void memxor(const char* in1, const char* in2, char* out, ssize_t n) {
    const char* end = in1 + n;
    while (in1 < end) {
       *out = *in1 ^ *in2;
       ++in1; 
       ++in2;
       ++out;
    }
}
"""

code2 = """
PyObject* res = PyString_FromStringAndSize(NULL, real_size);

const ssize_t tail = (ssize_t)PyString_AS_STRING(res) % ALIGNMENT;
const ssize_t head = (ALIGNMENT - tail) % ALIGNMENT;

memxor((const char*)a, (const char*)b, PyString_AS_STRING(res), head);

const __m128i* pa = (__m128i*)((char*)a + head);
const __m128i* pend = (__m128i*)((char*)a + real_size - tail);
const __m128i* pb = (__m128i*)((char*)b + head);
__m128i xmm1, xmm2;
__m128i* pc = (__m128i*)(PyString_AS_STRING(res) + head);
while (pa < pend) {
    xmm1 = _mm_loadu_si128(pa);
    xmm2 = _mm_loadu_si128(pb);
    _mm_stream_si128(pc, _mm_xor_si128(xmm1, xmm2));
    ++pa;
    ++pb;
    ++pc;
}
memxor((const char*)pa, (const char*)pb, (char*)pc, tail);
return_val = res;
Py_DECREF(res);
"""

def inline_xor_nocopy(aa, bb):
    real_size = len(aa)
    a = numpy.frombuffer(aa, dtype=numpy.uint64)
    b = numpy.frombuffer(bb, dtype=numpy.uint64)
    return weave.inline(code2, ["a", "b", "real_size"], 
                        headers = ['"emmintrin.h"'], 
                        support_code = support)

不同的是字符串是在C代码内部分配的.它不可能按照 SSE2 指令的要求在 16 字节边界对齐,因此使用字节访问复制开头和结尾的未对齐内存区域.

The difference is that the string is allocated inside the C code. It's impossible to have it aligned at a 16-byte-boundary as required by the SSE2 instructions, therefore the unaligned memory regions at the beginning and the end are copied using byte-wise access.

无论如何,输入数据是使用numpy数组传递的,因为weave坚持将Python的str对象复制到std::strings.frombuffer 不复制,所以这很好,但是内存没有对齐到 16 字节,所以我们需要使用 _mm_loadu_si128 而不是更快的 _mm_load_si128.

The input data is handed in using numpy arrays anyway, because weave insists on copying Python str objects to std::strings. frombuffer doesn't copy, so this is fine, but the memory is not aligned at 16 byte, so we need to use _mm_loadu_si128 instead of the faster _mm_load_si128.

我们使用 _mm_stream_si128 而不是使用 _mm_store_si128,这将确保任何写入都尽快流式传输到主内存---这样,输出数组不会用完宝贵的缓存行.

Instead of using _mm_store_si128, we use _mm_stream_si128, which will make sure that any writes are streamed to main memory as soon as possible---this way, the output array does not use up valuable cache lines.

至于时间,第一次编辑中的 slow_xor 条目指的是我的改进版本(内联按位异或,uint64),我消除了这种混淆.slow_xor 指的是原始问题的代码.所有计时均针对 1000 次运行完成.

As for the timings, the slow_xor entry in the first edit referred to my improved version (inline bitwise xor, uint64), I removed that confusion. slow_xor refers to the code from the original questions. All timings are done for 1000 runs.

  • slow_xor:1.85s (1x)
  • faster_slow_xor:1.25s (1.48x)
  • inline_xor:0.95s (1.95x)
  • inline_xor_nocopy:0.32s (5.78x)
  • slow_xor: 1.85s (1x)
  • faster_slow_xor: 1.25s (1.48x)
  • inline_xor: 0.95s (1.95x)
  • inline_xor_nocopy: 0.32s (5.78x)

代码是使用 gcc 4.4.3 编译的,我已经验证编译器实际上使用了 SSE 指令.

The code was compiled using gcc 4.4.3 and I've verified that the compiler actually uses the SSE instructions.

这篇关于简单的 Python 挑战:数据缓冲区上最快的按位异或的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆