转移一个__m128i的最好方法? [英] The best way to shift a __m128i?

查看:1479
本文介绍了转移一个__m128i的最好方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要转向一个__m128i变量(比如V),m位,在该位通过所有的变量(因此,所产生的变量重新presents V * 2 ^米)的移动这样的方式。
什么是做到这一点的最好方法是什么?!

注意_mm_slli_epi64变化V0和V1分开:

  R0:= V0<<计数
R1:= V1<<计数

所以V0的最后位失之交臂,但我想那些位移动到R1。

编辑:
我在寻找一个code,比这更快(M< 64):

  R0 = V0<<米;
R1 = V0>> (64米);
R1 = ^ v 1 LT;<米;
R2 = V1>> (64米);


解决方案

有关编译时间常数移位计数,可以得到比较好的效果。否则,真的没有。

这只是一个SSE执行 R0 / R1 code从你的问题,因为有无其他明显的方式来做到这一点。可变数量的变化仅适用于矢量元素,而不是针对整个寄存器的字节变化中位变化。所以我们只携带低64位到64高,并使用可变计数转变为把他们在正确的地方。

  //未经检验
#包括LT&;&immintrin.h GT;/ *一些编译器可能会在SLLI / srli与非编译时间常数ARGS呛
 * GCC产生XMM,将imm8形式与常量,
 *与否则产生XMM,XMM形式。 (用MOVD获得在XMM计数)
 * ///不优化的特殊情况下计数%8 = 0
//也许可以这样做,在海湾合作委员会与IF(__ builtin_constant_p(计数)){如果回...(计数%8!); }
__m128i mm_bitshift_left(__ m128i X,无符号数)
{
    __m128i携带= _mm_bslli_si128(X,8); //旧的编译器只具有容易混淆的命名_mm_slli_si128代名词
    如果(计数> = 64)
        返回_mm_slli_epi64(携带,计数64); //非随身携带的部分是全零,所以提前返回
    //其他
    携带= _mm_srli_epi64(携带64支); // bslli后左移由64B离开    X = _mm_slli_epi64(X,计数);
    返回_mm_or_si128(X,随身携带);
}__m128i mm_bitshift_left_3(__ m128i X){//通过特定的常数,看到内嵌版本不变
    返回mm_bitshift_left(X,3);
}
//通过特定的常数,看到内嵌版本不变
__m128i mm_bitshift_left_100(__ m128i X){返回mm_bitshift_left(X,100); }

我认为这将是比它原来是不太方便。 _mm_slli_epi64 适用于海合会/哗/国际商会甚至当计数不是编译时间常数(生成一个 MOVD 从整章第XMM REG)。有一个 _mm_sll_epi64(__m128i一,__m128i计)(注意缺少 I ),但至少这些天中, I 内在可产生的任何形式 psllq


编译时间常数计数版本是相当有效的,编译4条指令(或5不AVX)

  mm_bitshift_left_3(长长__vector(2)):
        vpslldq将xmm1,XMM0,8
        vpsrlq将xmm1,xmm1中,61
        vpsllq XMM0,XMM0,3
        VPOR XMM0,XMM0,xmm1中
        RET

性能:

这有3个周期的延迟(vpslldq(1) - > vpsrlq(1) - > VPOR(1))英特尔SNB / IVB / Haswell的,吞吐量局限于一个每2个周期(饱和矢量变换单元端口0)。字节偏移不同的端口上的洗牌机上运行。立即计数矢量位移都是单UOP指令,所以这是唯一的4个稠域微指令与其他code混合时占用管道的空间。 (可变计数矢量位移是2 UOP公司,2个周期的延迟,所以这个函数的变量数的版本是从计算指令不如它看起来。)

或为计数> = 64:

  mm_bitshift_left_100(长长__vector(2)):
        vpslldq XMM0,XMM0,8
        vpsllq XMM0,XMM0,36
        RET


如果您移位计数的的编译时间常数,你必须对分支计数> 64弄清楚是否向左或向右移动随身携带。我相信移位次数PTED为无符号整数间$ P $,所以负计数是不可能的。

它也需要额外的指令来获得 INT 计数和64数到向量寄存器。在网点的方式与载体这样做比较,并混合指令是可能的,但一个分支可能是一个不错的主意。


__ uint128_t 变量数版本GP寄存器看起来还算不错;比上交所的版本更好。 锵确实比gcc的一个稍微好一点的工作,散发更少的 MOV 说明,但它仍然使用了两个 CMOV 指令计数> = 64的情况下。 (因为86整数移位指令掩盖数,而不是饱和。)

  __ uint128_t leftshift_int128(__ uint128_t X,无符号数){
    返回X<<计数; //未定义如果count> = 128
}

I need to shift a __m128i variable, (say v), by m bits, in such a way that bits move through all of the variable (So, the resulting variable represents v*2^m). What is the best way to do this?!

Note that _mm_slli_epi64 shifts v0 and v1 seperately:

r0 := v0 << count
r1 := v1 << count

so the last bits of v0 missed, but I want to move those bits to r1.

Edit: I looking for a code, faster than this (m<64):

r0 = v0 << m;
r1 = v0 >> (64-m);
r1 ^= v1 << m;
r2 = v1 >> (64-m);

解决方案

For compile-time constant shift counts, you can get fairly good results. Otherwise not really.

This is just an SSE implementation of the r0 / r1 code from your question, since there's no other obvious way to do it. Variable-count shifts are only available for bit-shifts within vector elements, not for byte-shifts of the whole register. So we just carry the low 64bits up to the high 64 and use a variable-count shift to put them in the right place.

// untested
#include <immintrin.h>

/* some compilers might choke on slli / srli with non-compile-time-constant args
 * gcc generates the   xmm, imm8 form with constants,
 * and generates the   xmm, xmm  form with otherwise.  (With movd to get the count in an xmm)
 */

// doesn't optimize for the special-case where count%8 = 0
// could maybe do that in gcc with if(__builtin_constant_p(count)) { if (!count%8) return ...; }
__m128i mm_bitshift_left(__m128i x, unsigned count)
{
    __m128i carry = _mm_bslli_si128(x, 8);   // old compilers only have the confusingly named _mm_slli_si128 synonym
    if (count >= 64)
        return _mm_slli_epi64(carry, count-64);  // the non-carry part is all zero, so return early
    // else
    carry = _mm_srli_epi64(carry, 64-count);  // After bslli shifted left by 64b

    x = _mm_slli_epi64(x, count);
    return _mm_or_si128(x, carry);
}

__m128i mm_bitshift_left_3(__m128i x) { // by a specific constant, to see inlined constant version
    return mm_bitshift_left(x, 3);
}
// by a specific constant, to see inlined constant version
__m128i mm_bitshift_left_100(__m128i x) { return mm_bitshift_left(x, 100);  }

I thought this was going to be less convenient than it turned out to be. _mm_slli_epi64 works on gcc/clang/icc even when the count is not a compile-time constant (generating a movd from integer reg to xmm reg). There is a _mm_sll_epi64 (__m128i a, __m128i count) (note the lack of i), but at least these days, the i intrinsic can generate either form of psllq.


The compile-time-constant count versions are fairly efficient, compiling to 4 instructions (or 5 without AVX):

mm_bitshift_left_3(long long __vector(2)):
        vpslldq xmm1, xmm0, 8
        vpsrlq  xmm1, xmm1, 61
        vpsllq  xmm0, xmm0, 3
        vpor    xmm0, xmm0, xmm1
        ret

Performance:

This has 3 cycle latency (vpslldq(1) -> vpsrlq(1) -> vpor(1)) on Intel SnB/IvB/Haswell, with throughput limited to one per 2 cycles (saturating the vector shift unit on port 0). Byte-shift runs on the shuffle unit on a different port. Immediate-count vector shifts are all single-uop instructions, so this is only 4 fused-domain uops taking up pipeline space when mixed in with other code. (Variable-count vector shifts are 2 uop, 2 cycle latency, so the variable-count version of this function is worse than it looks from counting instructions.)

Or for counts >= 64:

mm_bitshift_left_100(long long __vector(2)):
        vpslldq xmm0, xmm0, 8
        vpsllq  xmm0, xmm0, 36
        ret


If your shift-count is not a compile-time constant, you have to branch on count > 64 to figure out whether to left or right shift the carry. I believe the shift count is interpreted as an unsigned integer, so a negative count is impossible.

It also takes extra instructions to get the int count and 64-count into vector registers. Doing this in a branchless fashion with vector compares and a blend instruction might be possible, but a branch is probably a good idea.


The variable-count version for __uint128_t in GP registers looks fairly good; better than the SSE version. Clang does a slightly better job than gcc, emitting fewer mov instructions, but it still uses two cmov instructions for the count >= 64 case. (Because x86 integer shift instructions mask the count, instead of saturating.)

__uint128_t leftshift_int128(__uint128_t x, unsigned count) {
    return x << count;  // undefined if count >= 128
}

这篇关于转移一个__m128i的最好方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆