转移一个__m128i的最好方法? [英] The best way to shift a __m128i?
问题描述
我需要转向一个__m128i变量(比如V),m位,在该位通过所有的变量(因此,所产生的变量重新presents V * 2 ^米)的移动这样的方式。
什么是做到这一点的最好方法是什么?!
注意_mm_slli_epi64变化V0和V1分开:
R0:= V0<<计数
R1:= V1<<计数
所以V0的最后位失之交臂,但我想那些位移动到R1。
编辑:
我在寻找一个code,比这更快(M< 64):
R0 = V0<<米;
R1 = V0>> (64米);
R1 = ^ v 1 LT;<米;
R2 = V1>> (64米);
有关编译时间常数移位计数,可以得到比较好的效果。否则,真的没有。
这只是一个SSE执行 R0
/ R1
code从你的问题,因为有无其他明显的方式来做到这一点。可变数量的变化仅适用于矢量元素,而不是针对整个寄存器的字节变化中位变化。所以我们只携带低64位到64高,并使用可变计数转变为把他们在正确的地方。
//未经检验
#包括LT&;&immintrin.h GT;/ *一些编译器可能会在SLLI / srli与非编译时间常数ARGS呛
* GCC产生XMM,将imm8形式与常量,
*与否则产生XMM,XMM形式。 (用MOVD获得在XMM计数)
* ///不优化的特殊情况下计数%8 = 0
//也许可以这样做,在海湾合作委员会与IF(__ builtin_constant_p(计数)){如果回...(计数%8!); }
__m128i mm_bitshift_left(__ m128i X,无符号数)
{
__m128i携带= _mm_bslli_si128(X,8); //旧的编译器只具有容易混淆的命名_mm_slli_si128代名词
如果(计数> = 64)
返回_mm_slli_epi64(携带,计数64); //非随身携带的部分是全零,所以提前返回
//其他
携带= _mm_srli_epi64(携带64支); // bslli后左移由64B离开 X = _mm_slli_epi64(X,计数);
返回_mm_or_si128(X,随身携带);
}__m128i mm_bitshift_left_3(__ m128i X){//通过特定的常数,看到内嵌版本不变
返回mm_bitshift_left(X,3);
}
//通过特定的常数,看到内嵌版本不变
__m128i mm_bitshift_left_100(__ m128i X){返回mm_bitshift_left(X,100); }
我认为这将是比它原来是不太方便。 _mm_slli_epi64
适用于海合会/哗/国际商会甚至当计数不是编译时间常数(生成一个 MOVD
从整章第XMM REG)。有一个 _mm_sll_epi64(__m128i一,__m128i计)
(注意缺少 I
),但至少这些天中, I
内在可产生的任何形式 psllq
。
编译时间常数计数版本是相当有效的,编译4条指令(或5不AVX)
mm_bitshift_left_3(长长__vector(2)):
vpslldq将xmm1,XMM0,8
vpsrlq将xmm1,xmm1中,61
vpsllq XMM0,XMM0,3
VPOR XMM0,XMM0,xmm1中
RET
这有3个周期的延迟(vpslldq(1) - > vpsrlq(1) - > VPOR(1))英特尔SNB / IVB / Haswell的,吞吐量局限于一个每2个周期(饱和矢量变换单元端口0)。字节偏移不同的端口上的洗牌机上运行。立即计数矢量位移都是单UOP指令,所以这是唯一的4个稠域微指令与其他code混合时占用管道的空间。 (可变计数矢量位移是2 UOP公司,2个周期的延迟,所以这个函数的变量数的版本是从计算指令不如它看起来。)
或为计数> = 64:
mm_bitshift_left_100(长长__vector(2)):
vpslldq XMM0,XMM0,8
vpsllq XMM0,XMM0,36
RET
如果您移位计数的不的编译时间常数,你必须对分支计数> 64弄清楚是否向左或向右移动随身携带。我相信移位次数PTED为无符号整数间$ P $,所以负计数是不可能的。
它也需要额外的指令来获得 INT
计数和64数到向量寄存器。在网点的方式与载体这样做比较,并混合指令是可能的,但一个分支可能是一个不错的主意。
为 __ uint128_t
变量数版本GP寄存器看起来还算不错;比上交所的版本更好。 锵确实比gcc的一个稍微好一点的工作,散发更少的 MOV
说明一>,但它仍然使用了两个 CMOV
指令计数> = 64的情况下。 (因为86整数移位指令掩盖数,而不是饱和。)
__ uint128_t leftshift_int128(__ uint128_t X,无符号数){
返回X<<计数; //未定义如果count> = 128
}
I need to shift a __m128i variable, (say v), by m bits, in such a way that bits move through all of the variable (So, the resulting variable represents v*2^m). What is the best way to do this?!
Note that _mm_slli_epi64 shifts v0 and v1 seperately:
r0 := v0 << count
r1 := v1 << count
so the last bits of v0 missed, but I want to move those bits to r1.
Edit: I looking for a code, faster than this (m<64):
r0 = v0 << m;
r1 = v0 >> (64-m);
r1 ^= v1 << m;
r2 = v1 >> (64-m);
For compile-time constant shift counts, you can get fairly good results. Otherwise not really.
This is just an SSE implementation of the r0
/ r1
code from your question, since there's no other obvious way to do it. Variable-count shifts are only available for bit-shifts within vector elements, not for byte-shifts of the whole register. So we just carry the low 64bits up to the high 64 and use a variable-count shift to put them in the right place.
// untested
#include <immintrin.h>
/* some compilers might choke on slli / srli with non-compile-time-constant args
* gcc generates the xmm, imm8 form with constants,
* and generates the xmm, xmm form with otherwise. (With movd to get the count in an xmm)
*/
// doesn't optimize for the special-case where count%8 = 0
// could maybe do that in gcc with if(__builtin_constant_p(count)) { if (!count%8) return ...; }
__m128i mm_bitshift_left(__m128i x, unsigned count)
{
__m128i carry = _mm_bslli_si128(x, 8); // old compilers only have the confusingly named _mm_slli_si128 synonym
if (count >= 64)
return _mm_slli_epi64(carry, count-64); // the non-carry part is all zero, so return early
// else
carry = _mm_srli_epi64(carry, 64-count); // After bslli shifted left by 64b
x = _mm_slli_epi64(x, count);
return _mm_or_si128(x, carry);
}
__m128i mm_bitshift_left_3(__m128i x) { // by a specific constant, to see inlined constant version
return mm_bitshift_left(x, 3);
}
// by a specific constant, to see inlined constant version
__m128i mm_bitshift_left_100(__m128i x) { return mm_bitshift_left(x, 100); }
I thought this was going to be less convenient than it turned out to be. _mm_slli_epi64
works on gcc/clang/icc even when the count is not a compile-time constant (generating a movd
from integer reg to xmm reg). There is a _mm_sll_epi64 (__m128i a, __m128i count)
(note the lack of i
), but at least these days, the i
intrinsic can generate either form of psllq
.
The compile-time-constant count versions are fairly efficient, compiling to 4 instructions (or 5 without AVX):
mm_bitshift_left_3(long long __vector(2)):
vpslldq xmm1, xmm0, 8
vpsrlq xmm1, xmm1, 61
vpsllq xmm0, xmm0, 3
vpor xmm0, xmm0, xmm1
ret
This has 3 cycle latency (vpslldq(1) -> vpsrlq(1) -> vpor(1)) on Intel SnB/IvB/Haswell, with throughput limited to one per 2 cycles (saturating the vector shift unit on port 0). Byte-shift runs on the shuffle unit on a different port. Immediate-count vector shifts are all single-uop instructions, so this is only 4 fused-domain uops taking up pipeline space when mixed in with other code. (Variable-count vector shifts are 2 uop, 2 cycle latency, so the variable-count version of this function is worse than it looks from counting instructions.)
Or for counts >= 64:
mm_bitshift_left_100(long long __vector(2)):
vpslldq xmm0, xmm0, 8
vpsllq xmm0, xmm0, 36
ret
If your shift-count is not a compile-time constant, you have to branch on count > 64 to figure out whether to left or right shift the carry. I believe the shift count is interpreted as an unsigned integer, so a negative count is impossible.
It also takes extra instructions to get the int
count and 64-count into vector registers. Doing this in a branchless fashion with vector compares and a blend instruction might be possible, but a branch is probably a good idea.
The variable-count version for __uint128_t
in GP registers looks fairly good; better than the SSE version. Clang does a slightly better job than gcc, emitting fewer mov
instructions, but it still uses two cmov
instructions for the count >= 64 case. (Because x86 integer shift instructions mask the count, instead of saturating.)
__uint128_t leftshift_int128(__uint128_t x, unsigned count) {
return x << count; // undefined if count >= 128
}
这篇关于转移一个__m128i的最好方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!