转移一个m128i的最好方法？ [英] The best way to shift a m128i?

查看：1479 发布时间：2016/8/21 21:29:20 c bitwise-operators sse bit-shift sse2

本文介绍了转移一个__m128i的最好方法？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要转向一个__m128i变量（比如V），m位，在该位通过所有的变量（因此，所产生的变量重新presents V * 2 ^米）的移动这样的方式。
什么是做到这一点的最好方法是什么？！

注意_mm_slli_epi64变化V0和V1分开：

  R0：= V0＆LT;＆LT;计数
R1：= V1＆LT;＆LT;计数

所以V0的最后位失之交臂，但我想那些位移动到R1。

编辑：
我在寻找一个code，比这更快（M＆LT; 64）：

  R0 = V0＆LT;＆LT;米;
R1 = V0＆GT;＆GT; （64米）;
R1 = ^ v 1 LT;＆LT;米;
R2 = V1＆GT;＆GT; （64米）;

解决方案

有关编译时间常数移位计数，可以得到比较好的效果。否则，真的没有。

这只是一个SSE执行 R0 / R1 code从你的问题，因为有无其他明显的方式来做到这一点。可变数量的变化仅适用于矢量元素，而不是针对整个寄存器的字节变化中位变化。所以我们只携带低64位到64高，并使用可变计数转变为把他们在正确的地方。

  //未经检验
＃包括LT＆;＆immintrin.h GT;/ *一些编译器可能会在SLLI / srli与非编译时间常数ARGS呛
 * GCC产生XMM，将imm8形式与常量，
 *与否则产生XMM，XMM形式。 （用MOVD获得在XMM计数）
 * ///不优化的特殊情况下计数％8 = 0
//也许可以这样做，在海湾合作委员会与IF（__ builtin_constant_p（计数））{如果回...（计数％8！）; }
__m128i mm_bitshift_left（__ m128i X，无符号数）
{
    __m128i携带= _mm_bslli_si128（X，8）; //旧的编译器只具有容易混淆的命名_mm_slli_si128代名词
    如果（计数＆GT; = 64）
        返回_mm_slli_epi64（携带，计数64）; //非随身携带的部分是全零，所以提前返回
    //其他
    携带= _mm_srli_epi64（携带64支）; // bslli后左移由64B离开    X = _mm_slli_epi64（X，计数）;
    返回_mm_or_si128（X，随身携带）;
}__m128i mm_bitshift_left_3（__ m128i X）{//通过特定的常数，看到内嵌版本不变
    返回mm_bitshift_left（X，3）;
}
//通过特定的常数，看到内嵌版本不变
__m128i mm_bitshift_left_100（__ m128i X）{返回mm_bitshift_left（X，100）; }

我认为这将是比它原来是不太方便。 _mm_slli_epi64 适用于海合会/哗/国际商会甚至当计数不是编译时间常数（生成一个 MOVD 从整章第XMM REG）。有一个 _mm_sll_epi64（__m128i一，__m128i计）（注意缺少 I ），但至少这些天中， I 内在可产生的任何形式 psllq 。

编译时间常数计数版本是相当有效的，编译4条指令（或5不AVX）

  mm_bitshift_left_3（长长__vector（2））：
        vpslldq将xmm1，XMM0，8
        vpsrlq将xmm1，xmm1中，61
        vpsllq XMM0，XMM0，3
        VPOR XMM0，XMM0，xmm1中
        RET

性能：

这有3个周期的延迟（vpslldq（1） - > vpsrlq（1） - > VPOR（1））英特尔SNB / IVB / Haswell的，吞吐量局限于一个每2个周期（饱和矢量变换单元端口0）。字节偏移不同的端口上的洗牌机上运行。立即计数矢量位移都是单UOP指令，所以这是唯一的4个稠域微指令与其他code混合时占用管道的空间。（可变计数矢量位移是2 UOP公司，2个周期的延迟，所以这个函数的变量数的版本是从计算指令不如它看起来。）

或为计数> = 64：

  mm_bitshift_left_100（长长__vector（2））：
        vpslldq XMM0，XMM0，8
        vpsllq XMM0，XMM0，36
        RET

如果您移位计数的不的编译时间常数，你必须对分支计数> 64弄清楚是否向左或向右移动随身携带。我相信移位次数PTED为无符号整数间$ P $，所以负计数是不可能的。

它也需要额外的指令来获得 INT 计数和64数到向量寄存器。在网点的方式与载体这样做比较，并混合指令是可能的，但一个分支可能是一个不错的主意。

为 __ uint128_t 变量数版本GP寄存器看起来还算不错;比上交所的版本更好。锵确实比gcc的一个稍微好一点的工作，散发更少的 MOV 说明，但它仍然使用了两个 CMOV 指令计数> = 64的情况下。（因为86整数移位指令掩盖数，而不是饱和。）

  __ uint128_t leftshift_int128（__ uint128_t X，无符号数）{
    返回X＆LT;＆LT;计数; //未定义如果count＆GT; = 128
}

I need to shift a __m128i variable, (say v), by m bits, in such a way that bits move through all of the variable (So, the resulting variable represents v*2^m). What is the best way to do this?!

Note that _mm_slli_epi64 shifts v0 and v1 seperately:

r0 := v0 << count
r1 := v1 << count

so the last bits of v0 missed, but I want to move those bits to r1.

Edit: I looking for a code, faster than this (m<64):

r0 = v0 << m;
r1 = v0 >> (64-m);
r1 ^= v1 << m;
r2 = v1 >> (64-m);

解决方案

For compile-time constant shift counts, you can get fairly good results. Otherwise not really.

This is just an SSE implementation of the r0 / r1 code from your question, since there's no other obvious way to do it. Variable-count shifts are only available for bit-shifts within vector elements, not for byte-shifts of the whole register. So we just carry the low 64bits up to the high 64 and use a variable-count shift to put them in the right place.

// untested
#include <immintrin.h>

/* some compilers might choke on slli / srli with non-compile-time-constant args
 * gcc generates the   xmm, imm8 form with constants,
 * and generates the   xmm, xmm  form with otherwise.  (With movd to get the count in an xmm)
 */

// doesn't optimize for the special-case where count%8 = 0
// could maybe do that in gcc with if(__builtin_constant_p(count)) { if (!count%8) return ...; }
__m128i mm_bitshift_left(__m128i x, unsigned count)
{
    __m128i carry = _mm_bslli_si128(x, 8);   // old compilers only have the confusingly named _mm_slli_si128 synonym
    if (count >= 64)
        return _mm_slli_epi64(carry, count-64);  // the non-carry part is all zero, so return early
    // else
    carry = _mm_srli_epi64(carry, 64-count);  // After bslli shifted left by 64b

    x = _mm_slli_epi64(x, count);
    return _mm_or_si128(x, carry);
}

__m128i mm_bitshift_left_3(__m128i x) { // by a specific constant, to see inlined constant version
    return mm_bitshift_left(x, 3);
}
// by a specific constant, to see inlined constant version
__m128i mm_bitshift_left_100(__m128i x) { return mm_bitshift_left(x, 100);  }

I thought this was going to be less convenient than it turned out to be. _mm_slli_epi64 works on gcc/clang/icc even when the count is not a compile-time constant (generating a movd from integer reg to xmm reg). There is a _mm_sll_epi64 (__m128i a, __m128i count) (note the lack of i), but at least these days, the i intrinsic can generate either form of psllq.

The compile-time-constant count versions are fairly efficient, compiling to 4 instructions (or 5 without AVX):

mm_bitshift_left_3(long long __vector(2)):
        vpslldq xmm1, xmm0, 8
        vpsrlq  xmm1, xmm1, 61
        vpsllq  xmm0, xmm0, 3
        vpor    xmm0, xmm0, xmm1
        ret

Performance:

This has 3 cycle latency (vpslldq(1) -> vpsrlq(1) -> vpor(1)) on Intel SnB/IvB/Haswell, with throughput limited to one per 2 cycles (saturating the vector shift unit on port 0). Byte-shift runs on the shuffle unit on a different port. Immediate-count vector shifts are all single-uop instructions, so this is only 4 fused-domain uops taking up pipeline space when mixed in with other code. (Variable-count vector shifts are 2 uop, 2 cycle latency, so the variable-count version of this function is worse than it looks from counting instructions.)

Or for counts >= 64:

mm_bitshift_left_100(long long __vector(2)):
        vpslldq xmm0, xmm0, 8
        vpsllq  xmm0, xmm0, 36
        ret

If your shift-count is not a compile-time constant, you have to branch on count > 64 to figure out whether to left or right shift the carry. I believe the shift count is interpreted as an unsigned integer, so a negative count is impossible.

It also takes extra instructions to get the int count and 64-count into vector registers. Doing this in a branchless fashion with vector compares and a blend instruction might be possible, but a branch is probably a good idea.

The variable-count version for __uint128_t in GP registers looks fairly good; better than the SSE version. Clang does a slightly better job than gcc, emitting fewer mov instructions, but it still uses two cmov instructions for the count >= 64 case. (Because x86 integer shift instructions mask the count, instead of saturating.)

__uint128_t leftshift_int128(__uint128_t x, unsigned count) {
    return x << count;  // undefined if count >= 128
}

这篇关于转移一个__m128i的最好方法？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

转移一个m128i的最好方法？ [英] The best way to shift a m128i?

问题描述

相关文章

C/C++最新文章

热门教程

热门工具

登录关闭

转移一个__m128i的最好方法？ [英] The best way to shift a __m128i?

问题描述

相关文章

C/C++最新文章

热门教程

热门工具

登录 关闭

转移一个m128i的最好方法？ [英] The best way to shift a m128i?

登录关闭