如何使用SIMD加速内存XOR两块? [英] How can I use SIMD to accelerate XOR two blocks of memory?

查看:670
本文介绍了如何使用SIMD加速内存XOR两块?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要尽快异或内存两大块,我如何使用SIMD加速吧?

我原来的code是如下:

 无效region_xor_w64(无符号字符* R1,/ * 1区* /
                       无符号字符* R2,/ *区域2 * /
                       诠释为nbytes)/ *在区域的字节数* /
{
    uint64_t中* L1;
    uint64_t中* 12;
    uint64_t中* LTOP;
    无符号字符* CTOP;    CTOP = R1 +的nbytes;
    LTOP =(uint64_t中*)CTOP;
    L1 =(uint64_t中*)R1;
    L2 =(uint64_t中*)R2;    而(L1< LTOP){
        * 12 =((* L1)^(* 12));
        L1 ++;
        12 ++;
    }
}

我写一个自己,但很少速度提高。

 无效region_xor_sse(无符号字符* DST,
                       无符号字符* SRC,
                       INT BLOCK_SIZE){
  常量__m128i * wrd_ptr =(__m128i *)SRC;
  常量__m128i * wrd_end =(__m128i *)(SRC + BLOCK_SIZE);
  __m128i * dst_ptr =(__m128i *)DST;  做{
    __m128i将xmm1 = _mm_load_si128(wrd_ptr);
    __m128i XMM2 = _mm_load_si128(dst_ptr);    XMM2 = _mm_xor_si128(将xmm1,XMM2);
    _mm_store_si128(dst_ptr,XMM2);
    ++ dst_ptr;
    ++ wrd_ptr;
  }而(wrd_ptr< wrd_end);
}


解决方案

更​​重要的问题是,为什么你想要做手工。你有一个古老的编译器,你认为你可以智取?那些美好的时候,你不得不手动编写SIMD指令已经结束。目前,案件的编译器中99%将做的工作适合你,没准比它会做很多工作做得更好。另外,不要忘记,有与越来越多的扩展指令集,同时现身每隔一段时间新的架构。所以问自己一个问题 - 你要保持你的实现为每个平台的N份?你想不断测试您的实现,以确保它是值得保留?最有可能的答案是否定的。

您需要做的唯一的事情就是写一个最简单的code。编译器将做休息。举例来说,这里是我会怎么写你的函数:

 无效region_xor_w64(无符号字符* R1,无符号字符* R2,无符号整型LEN)
{
    无符号整型我;
    对于(i = 0; I< LEN ++ I)
        R2由[i] = R1 [I] ^ R2 [I]
}

简单一点,不是吗?你猜怎么着,编译器生成code,使用执行128位XOR MOVDQU 和的 PXOR ,关键路径是这样的:

  4008a0:F3 0F 04 6F 06 MOVDQU XMM0,XMMWORD PTR [RSI + RAX * 1]
4008a5:41 83 01 C0加r8d,为0x1
4008a9:F3 0F 1207米0C 07 MOVDQU将xmm1,XMMWORD PTR [RDI + RAX * 1]
4008ae:66 0F EF C1 PXOR XMM0,xmm1中
4008b2:F3 0F 04 7F 06 MOVDQU XMMWORD PTR [RSI + RAX * 1],XMM0
4008b7:48 83 10 C0加RAX,为0x10
4008bb:45 39 C1 CMP r9d,r8d
4008be:77 E0 JA 4008a0< region_xor_w64 + 0X40>

由于@Mysticial指出,上述code是使用支持未对齐访问指令。这些都是比较慢。然而,如果程序员可以正确地假定对准访问则有可能让编译器知道它。例如:

 无效region_xor_w64(无符号字符*限制R1,
                    无符号字符*限制R2,
                    unsigned int类型LEN)
{
    无符号字符*限制P1 = __builtin_assume_aligned(R1,16);
    无符号字符*限制P2 = __builtin_assume_aligned(R2,16);    无符号整型我;
    对于(i = 0; I< LEN ++ I)
        P2 [I] = P1 [I] ^ p2的[I];
}

编译器生成上述C code以下(通知 MOVDQA

  400880:66 0F 1207米04 06 MOVDQA XMM0,XMMWORD PTR [RSI + RAX * 1]
400885:41 83 01 C0加r8d,为0x1
400889:66 0F EF 04 07 PXOR XMM0,XMMWORD PTR [RDI + RAX * 1]
40088e:66 0F 1408米04 06 MOVDQA XMMWORD PTR [RSI + RAX * 1],XMM0
400893:48 83 10 C0加RAX,为0x10
400897:45 39 C1 CMP r9d,r8d
40089a:77 E4 JA 400880< region_xor_w64 + 0x20的>

明天,当我自己买了Haswell的CPU一台笔记本电脑,编译器将生成我使用256位指令,而不是128位来自同一code给我两倍的载体表现code 。它会做它,即使我不知道的Haswell是可以做到这一点。你将不得不不仅知道这些功能,但写code的另一个版本,花一些时间测试。

顺便说一句,好像你也可以在你的实现中的错误,其中code可以跳过高达3剩余的字节数据矢量。

在任何情况下,我会建议你信任你的编译器,并学习如何确认是什么产生(即熟悉 objdump的)。下一个选择是改变的编译器。只有这样,开始思考手动编写向量处理指令。或者你要去有一个坏的时间!

希望它帮助。祝你好运!

I want to XOR two blocks of memory as quickly as possible, How can I use SIMD to accelerate it?

My original code is below:

void region_xor_w64(   unsigned char *r1,         /* Region 1 */
                       unsigned char *r2,         /* Region 2 */
                       int nbytes)       /* Number of bytes in region */
{
    uint64_t *l1;
    uint64_t *l2;
    uint64_t *ltop;
    unsigned char *ctop;

    ctop = r1 + nbytes;
    ltop = (uint64_t *) ctop;
    l1 = (uint64_t *) r1;
    l2 = (uint64_t *) r2;

    while (l1 < ltop) {
        *l2 = ((*l1)  ^ (*l2));
        l1++;
        l2++;
    }
}

I wrote one myself, but little speed increased.

void region_xor_sse(   unsigned char* dst,
                       unsigned char* src,
                       int block_size){
  const __m128i* wrd_ptr = (__m128i*)src;
  const __m128i* wrd_end = (__m128i*)(src+block_size);
  __m128i* dst_ptr = (__m128i*)dst;

  do{
    __m128i xmm1 = _mm_load_si128(wrd_ptr);
    __m128i xmm2 = _mm_load_si128(dst_ptr);

    xmm2 = _mm_xor_si128(xmm1, xmm2);
    _mm_store_si128(dst_ptr, xmm2);
    ++dst_ptr;
    ++wrd_ptr;
  }while(wrd_ptr < wrd_end);
}

解决方案

The more important question is why would you want to do it manually. Do you have an ancient compiler that you think you can outsmart? Those good old times when you had to manually write SIMD instructions are over. Today, in 99% of cases compiler will do the job for you, and chances are than it will do a lot better job. Also, don't forget that there are new architectures coming out every once in a while with more and more extended instruction set. So ask yourself a question — do you want to maintain N copies of your implementation for each platform? Do you want to constantly test your implementation to make sure it is worth maintaining? Most likely the answer would be no.

The only thing you need to do is to write the simplest possible code. Compiler will do the rest. For instance, here is how I would write your function:

void region_xor_w64(unsigned char *r1, unsigned char *r2, unsigned int len)
{
    unsigned int i;
    for (i = 0; i < len; ++i)
        r2[i] = r1[i] ^ r2[i];
}

A bit simpler, isn't it? And guess what, compiler is generating code that performs 128-bit XOR using MOVDQU and PXOR, the critical path looks like this:

4008a0:       f3 0f 6f 04 06          movdqu xmm0,XMMWORD PTR [rsi+rax*1]
4008a5:       41 83 c0 01             add    r8d,0x1
4008a9:       f3 0f 6f 0c 07          movdqu xmm1,XMMWORD PTR [rdi+rax*1]
4008ae:       66 0f ef c1             pxor   xmm0,xmm1
4008b2:       f3 0f 7f 04 06          movdqu XMMWORD PTR [rsi+rax*1],xmm0
4008b7:       48 83 c0 10             add    rax,0x10
4008bb:       45 39 c1                cmp    r9d,r8d
4008be:       77 e0                   ja     4008a0 <region_xor_w64+0x40>

As @Mysticial has pointed out, the above code is using instruction that support unaligned access. Those are slower. If, however, a programmer can correctly assume an aligned access then it is possible to let compiler know about it. For example:

void region_xor_w64(unsigned char * restrict r1,
                    unsigned char * restrict r2,
                    unsigned int len)
{
    unsigned char * restrict p1 = __builtin_assume_aligned(r1, 16);
    unsigned char * restrict p2 = __builtin_assume_aligned(r2, 16);

    unsigned int i;
    for (i = 0; i < len; ++i)
        p2[i] = p1[i] ^ p2[i];
}

The compiler generates the following for the above C code (notice movdqa):

400880:       66 0f 6f 04 06          movdqa xmm0,XMMWORD PTR [rsi+rax*1]
400885:       41 83 c0 01             add    r8d,0x1
400889:       66 0f ef 04 07          pxor   xmm0,XMMWORD PTR [rdi+rax*1]
40088e:       66 0f 7f 04 06          movdqa XMMWORD PTR [rsi+rax*1],xmm0
400893:       48 83 c0 10             add    rax,0x10
400897:       45 39 c1                cmp    r9d,r8d
40089a:       77 e4                   ja     400880 <region_xor_w64+0x20>

Tomorrow, when I buy myself a laptop with a Haswell CPU, the compiler will generate me a code that use 256-bit instructions instead of 128-bit from the same code giving me twice the vector performance. It would do it even if I didn't know that Haswell is capable of it. You would have to not only know about that feature, but write another version of your code and spend some time testing it.

By the way, it seems like you also have a bug in your implementation where the code can skip up to 3 remaining bytes in the data vector.

At any rate, I would recommend you trust your compiler and learn how to verify what is generates (i.e. get familiar with objdump). The next choice would be to change the compiler. Only then start thinking about writing vector processing instructions manually. Or you gonna have a bad time!

Hope it helps. Good Luck!

这篇关于如何使用SIMD加速内存XOR两块?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆