SSE：字节交换 [英] SSE: Byte swapping

查看：574 发布时间：2016/8/25 9:59:31 c x86 sse simd intrinsics

本文介绍了SSE：字节交换的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想通过上证所内部函数.ANY洞察力翻译这个code？

 为（uint32_t的I = 0; I＆LT;长度;我+ = 4，SRC + = 4，DEST + = 4）{
    uint32_t的值= *（* uint32_t的）SRC;
    *（* uint32_t的）DEST =（（价值＆GT;＆GT; 16）及0xFFFF的）| （价值＆LT;＆LT; 16）;
  }

解决方案

PSHUFB （SSSE3）应大于2班和一个或更快。此外，有轻微的修改洗牌面具将使endian转换，而不只是一个字交换。

偷保罗的r功能结构，只需更换矢量内部函数：

 无效word_swapping_ssse3（uint32_t的* DEST，常量uint32_t的* SRC，为size_t计数）
{
    为size_t我;
    __m128i shufmask = _mm_set_epi8（13,12，15,14，9,8，11,10，5,4，7,6，1,0，3,2）;
    // _mm_set ARGS go在大端为了某种原因。    对于（I = 0; I + 4℃; =计数; I + = 4）
    {
        __m128i S = _mm_loadu_si128（（__ m128i *）及SRC [I]）;
        __m128i D = _mm_shuffle_epi8（S，shufmask）;
        _mm_storeu_si128（（__ m128i *）及DEST [I]，D）;
    }
    对于（; I＆LT;计数; ++ I）//处理残余元素
    {
        uint32_t的W = SRC [I]
        W =（并且R w＆GT; 16）| （W＆所述;＆下; 16）;
        DEST [I] = W;
    }
}

PSHUFB 可以有一个内存操作数，但它是洗牌面膜，不被洗牌的数据。所以你不能把它作为一个洗牌负荷。：/

GCC不会产生大code为循环。主循环是

 ＃SRC：R8。 DEST：RCX。数：RAX。 shufmask：将xmm1
.L16：
        MOVQ％R9，RAX％
.L3：＃第一次迭代的切入点
        MOVDQU（％R8），％XMM0
        leaq 4（％RAX），％R9
        addq $ 16％R8
        addq $ 16％RCX
        PSHUFB％将xmm1，％XMM0
        MOVUPS％XMM0，-16（RCX％）
        cmpq％的RDX，R9％
        JBE .L16

与所有的循环开销，并需要单独的负载和存储指令，吞吐量只能每2个周期1洗牌。（8微指令，因为 CMP 宏观保险丝与乙脑）。

一个更快的循环将

  SHL $ 2，％＃RAX UINT计数 - ＆GT;字节数
  ＃检查RAX％小于16，跳过向量环
  ＃CMP / jsomething
  添加％RAX，R8％＃设置指针数组的结尾
  添加％RAX，RCX％
  NEG％RAX＃和向上计数到零
。循环：
  MOVDQU（％R8，％RAX），％XMM0
  PSHUFB％将xmm1，％XMM0
  MOVUPS％XMM0，（RCX％，％RAX）＃IDK为什么GCC选择MOVUPS的商店。较短的编码？
  加$ 16％RAX
  JL .loop
  ＃...
  ＃标清理

MOVDQU 负载可微保险丝复杂的寻址模式，不同于矢量ALU OPS，因此，所有这些说明，除了实体店单UOP公司，我相信。

本应在每次迭代1周期与一些展开运行，因为添加可微保险丝 JL 。所以循环有5个总微指令。其中3人为加载/存储OPS，其中有专门的端口。瓶颈是： PSHUFB 只能一个执行端口（Haswell的（SNB / IVB上运行，可以通过 PSHUFB 上的端口1和5 ））。每个周期存储（所有microarches）。最后，4个稠域每时钟限制英特尔CPU，这应该是禁止访问Nehalem处理器上的高速缓存未命中或更高版本（UOP循环缓冲区）微指令。

开卷将使每16B的总融合域微指令向下跌破4递增指针，而不是用复杂的寻址模式，将让商店微型保险丝。（减少循环开销是一件好事：让重新排序缓冲区填满了未来的迭代意味着CPU拥有的东西，当它击中一个错误predict在循环结束并返回到其他code做。）

这是pretty多少你会通过展开内在循环得到什么，因为Elalfer正确地指出将是一个不错的主意。与海湾合作委员会，尝试 -funroll-循环如果不膨胀的code太多了。

BTW，它可能会是更好的字节交换，同时加载或存储，与其它code混合，而不是转换的缓冲作为一个单独的操作。

I would like to translate this code using SSE intrinsics .Any insight ?

  for (uint32_t i = 0; i < length; i += 4, src += 4, dest += 4) {
    uint32_t value = *(uint32_t*)src;
    *(uint32_t*)dest = ((value >> 16) & 0xFFFF) | (value << 16);
  }

解决方案

pshufb (SSSE3) should be faster than 2 shifts and an OR. Also, a slight modification to the shuffle mask would enable an endian conversion, instead of just a word-swap.

stealing Paul R's function structure, just replacing the vector intrinsics:

void word_swapping_ssse3(uint32_t* dest, const uint32_t* src, size_t count)
{
    size_t i;
    __m128i shufmask =  _mm_set_epi8(13,12, 15,14,  9,8, 11,10,  5,4, 7,6,  1,0, 3,2);
    // _mm_set args go in big-endian order for some reason.                       

    for (i = 0; i + 4 <= count; i += 4)
    {
        __m128i s = _mm_loadu_si128((__m128i*)&src[i]);
        __m128i d = _mm_shuffle_epi8(s, shufmask);
        _mm_storeu_si128((__m128i*)&dest[i], d);
    }
    for ( ; i < count; ++i) // handle residual elements
    {
        uint32_t w = src[i];
        w = (w >> 16) | (w << 16);
        dest[i] = w;
    }
}

pshufb can have a memory operand, but it has to be the shuffle mask, not the data to be shuffled. So you can't use it as a shuffled-load. :/

gcc doesn't generate great code for the loop. The main loop is

# src: r8.  dest: rcx.  count: rax.  shufmask: xmm1
.L16:
        movq    %r9, %rax
.L3:  # first-iteration entry point
        movdqu  (%r8), %xmm0
        leaq    4(%rax), %r9
        addq    $16, %r8
        addq    $16, %rcx
        pshufb  %xmm1, %xmm0
        movups  %xmm0, -16(%rcx)
        cmpq    %rdx, %r9
        jbe     .L16

With all that loop overhead, and needing a separate load and store instruction, throughput will only be 1 shuffle per 2 cycles. (8 uops, since cmp macro-fuses with jbe).

A faster loop would be

  shl $2, %rax  # uint count  ->  byte count
  # check for %rax less than 16 and skip the vector loop
  # cmp / jsomething
  add %rax, %r8  # set up pointers to the end of the array
  add %rax, %rcx
  neg %rax       # and count upwards toward zero
.loop:
  movdqu (%r8, %rax), %xmm0
  pshufb  %xmm1, %xmm0
  movups  %xmm0, (%rcx, %rax)  # IDK why gcc chooses movups for stores.  Shorter encoding?
  add $16, %rax
  jl .loop
  # ...
  # scalar cleanup

movdqu loads can micro-fuse with complex addressing modes, unlike vector ALU ops, so all these instructions are single-uop except the store, I believe.

This should run at 1 cycle per iteration with some unrolling, since add can micro-fuse with jl. So the loop has 5 total uops. 3 of them are load/store ops, which have dedicated ports. Bottlenecks are: pshufb can only run on one execution port (Haswell (SnB/IvB can pshufb on ports 1&5)). One store per cycle (all microarches). And finally, the 4 fused-domain uops per clock limit for Intel CPUs, which should be reachable barring cache-misses on Nehalem and later (uop loop buffer).

Unrolling would bring the total fused-domain uops per 16B down below 4. Incrementing pointers, instead of using complex addressing modes, would let the stores micro-fuse. (Reducing loop overhead is always good: letting the re-order buffer fill up with future iterations means the CPU has something to do when it hits a mispredict at the end of the loop and returns to other code.)

This is pretty much what you'd get by unrolling the intrinsics loops, as Elalfer rightly suggests would be a good idea. With gcc, try -funroll-loops if that doesn't bloat the code too much.

BTW, it's probably going to be better to byte-swap while loading or storing, mixed in with other code, rather than converting a buffer as a separate operation.

这篇关于SSE：字节交换的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

SSE：字节交换 [英] SSE: Byte swapping

问题描述

相关文章

C/C++最新文章

热门教程

热门工具

登录关闭

SSE：字节交换 [英] SSE: Byte swapping

问题描述

相关文章

C/C++最新文章

热门教程

热门工具

登录 关闭

登录关闭