将4个整数右移不同的值SIMD [英] Shifting 4 integers right by different values SIMD

查看:249
本文介绍了将4个整数右移不同的值SIMD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

SSE不提供以可变量移位打包整数的方法(我可以使用任何指令AVX和更早版本)。你只能做匀速班。我要为向量中的每个整数尝试达到的结果是这个。

  i [0] = i [0] & 0b111111; 
i [1] =(i [1]> 6)& 0b111111;
i [2] =(i [2]>> 12)& 0b111111;
i [3] =(i [3]>> 18)& 0b111111;

基本上尝试在每个整数中隔离一组不同的6位。



那么什么是最优解呢?



我想到的事情:
你可以模拟一个变量右移,一个变量左移和一个统一的右移。我想到了每个乘以一个不同的数量(因此模拟左移)的打包的整数。然后结果,你可以做一个统一的右移得到答案。这个具体操作的问题,我会使用的乘法会 _mm_mullo_epi32 ,其中有令人失望的等待时间(10个周期的Haswell),并给予我的程序将不得不等待对于结果,这个特定的结果是对下一个指令的依赖。总体来说这种方法,我认为只会比蛮力方法,这是解压,换挡采用标量指令,然后重新包装的载体,这将需要大约20个周期我想快一点。

解决方案

这部分是什么样的依赖链?你可以展开和交织,使两个向量在飞行中一次吗?两个并行的长脱链比一个长脱链更好,如果它太长,以至于乱序窗口在下一个循环迭代中看不到下一个脱链。






您可以在Haswell和更高版本的CPU上使用单独的AVX2版本的函数,您可以使用 a variable-shift )。如果你这样做,你的函数只会在最高效的CPU上使用 pmulld mullo_epi32 >

pmulld 看起来理想的吞吐量和融合域uop计数甚至在Haswell。在SnB / IvB上,它是向量整数乘法单元的单个uop,整个函数只有2 uops / 6周期延迟/每1c吞吐量一个。 (这比我管理的shift / blend,所以你只想使用 pmulld 如果吞吐量/代码大小重要,你不是瓶颈例如展开后)。

  //查看godbolt链接下面有更多评论的版本
// SnB / IvB:6c延迟,2个融合域uop。
__m128i isolate_successive_6bits_mul(__m128i input)
{
//我们可以避免AND,如果我们将元素一直向左移动以敲除高垃圾位。
// 32 - 6 - 18 = 8个左移位的额外位
__m128i mul_constant = _mm_set_epi32(1 <(0 + 8),1<(6 + 8),1 < ;(12 + 8),1 <(18 + 8));
__m128i left_vshift = _mm_mullo_epi32(input,mul_constant);
__m128i rightshifted = _mm_srli_epi32(left_vshift,(18 + 8));
return rightshifted;
}






方式与混合:



移动并混合,以便每个元素都以正确的总移动计数结束。



在将所有内容合并为一个向量后,对低6位进行AND屏蔽。



与英特尔CPU上的强力方式吞吐量(因为uop较少)。只有两个立即混合绑定port5是不错的。 (AVX2 vpblendd 可以在任何端口上运行,但是我们只需使用 vpsrlvd 。)

  //似乎是Intel CPU的最佳选择。 
__m128i isolate_successive_6bits(__m128i input)
{// input = [D C B A]
// output = [D> 18 C>> 12 B> 6 A]& set1(0b111111)
__m128i right12 = _mm_srli_epi32(输入,12);
__m128i merged = _mm_blend_epi16(input,right12,0xF0); //复制上半部分,如`movhps`(但不要使用,因为额外的旁路延迟)
// merged = [D>> 12 C>> 12 B>> 0 A> ; 0]
__m128i right6 = _mm_srli_epi32(merged,6);
merged = _mm_blend_epi16(merged,right6,0b11001100); //在奇数元素中混合
//合并= [D>>(12 + 6)C>>> 12 B>(0 + 6)A> 0]
return _mm_and_si128(merged,_mm_set1_epi32(0b111111)); //只保留低6位
}

我把两个版本的Godbolt编译器浏览器



这个版本只有5 uops,编译gcc 5.3 -O3 -march = ivybridge

 #输入xmm0,结果为xmm0 
isolate_successive_6bits:
vpsrld将xmm1,XMM0,12#在周期0开始,导致准备周期1
vpblendw XMM0,XMM0,xmm1中,240#周期1
vpsrld开始将xmm1,XMM0,6#周期2
vpblendw XMM0,XMM0,xmm1中,204#周期3
vpand XMM0,XMM0,XMMWORD PTR .LC0 [RIP]#4循环,导致准备在周期5
ret

每个指令都取决于前一个指令,因此它有5c个延迟。 SNB / IVB / HSW / BDW CPU上只能有一个移端口,因此它们不能充分利用并行提供更强力的版本(这确实三班不同位移计数)。






暴力方法



做三个不同的移位计数,并使用三个立即混合( pblendw )将四个向量组合成一个具有每个所需元素的向量。

  //与Skylake上的旧版本相同的延迟
//在以前的Intel SnB系列CPU上慢一些。
isolate_successive_6bits_parallel:
vpsrld xmm1,xmm0,6#cycle 0. SKL:c0
vpsrld xmm2,xmm0,12#cycle 1(前Skylake上的资源冲突)。 SKL:c0
vpblendw xmm1,xmm0,xmm1,12#cycle 2(输入dep)。 SKL:c1
vpsrld xmm3,xmm0,18#cycle 2. SKL:c1
vpblendw xmm0,xmm2,xmm3,192#cycle 3(输入dep)。 SKL:c2
vpblendw xmm0,xmm1,xmm0,240#cycle 4(输入dep)。 SKL:c3
vpand xmm0,xmm0,XMMWORD PTR .LC0 [rip]#cycle 5(输入dep)。 SKL:c4。
ret

使用线性依赖链而不是树进行合并意味着合并可以完成最后一个移动结果准备完成后:

  isolate_successive_6bits_parallel2:
vpsrld xmm1,xmm0,6#c0。 SKL:c0
vpsrld xmm2,xmm0,12#c1。 SKL:c0
vpblendw xmm1,xmm0,xmm1,12#c2。 SKL:c1
vpblendw xmm1,xmm1,xmm2,48#c3。 SKL:c2
vpsrld xmm0,xmm0,18#c2。 SKL:c1
vpblendw xmm0,xmm1,xmm0,192#c4。 skL:c3(dep on xmm1)
vpand xmm0,xmm0,XMMWORD PTR .LC0 [rip]#c5。 SKL:c4
ret

Hmm,nope, SnB到BDW或SKL的延迟没有增益。第一次合并只发生在一次移位之后,因为未移位的输入是我们需要的一个元素。如果元素0需要非零移位计数,则这种方式对于前SKL将是有利的,并且可能是SKL的缺点。


SSE does not provide a way of shifting packed integers by a variable amount (I can use any instructions AVX and older). You can only do uniform shifts. The result I'm trying to achieve for each integer in the vector is this.

i[0] = i[0] & 0b111111;
i[1] = (i[1]>>6) & 0b111111;
i[2] = (i[2]>>12) & 0b111111;
i[3] = (i[3]>>18) & 0b111111;

Essentially trying to isolate a different group of 6 bits in each integer.

So what is the optimal solution?

Things I thought about: You can simulate a variable right shift, with a variable left shift and a uniform right shift. I thought about multiplying the packed integers by a different amount each (therefore simulating a left shift). Then with the result, you can do a uniform right shift get the answer. The problem with this the specific op I would use for the multiplication would be _mm_mullo_epi32, which has disappointing latency (10 cycles for haswell), and given my program it would have to wait for the result cause this particular result is a dependency for the next instructions. Overall that method I think would only be a little faster than the brute force method, which is unpack, shift using scalar instructions, and then repack the vector, which would take about 20 cycles I think.

解决方案

What kind of dependency chain is this part of? Can you unroll and interleave so two vectors are in flight at once? Two long dep chains in parallel is much better than one long dep chain, if it's so long that the out-of-order window can't see next dep chain in the next loop iteration.


It could be worth making a separate AVX2 version of your function for use on Haswell and later CPUs (where you can use a variable-shift). If you do that, your function will only use pmulld (mullo_epi32) on CPUs where it's most efficient.

pmulld looks ideal for throughput and fused-domain uop count even on Haswell. On SnB/IvB where it's a single uop for the vector-integer multiply unit, the whole function is only 2 uops / 6 cycle latency / one per 1c throughput. (Which is worse than I managed with shift/blend, so you'd only want to use pmulld if throughput/code-size matters at all, and you aren't bottlenecked purely on latency. e.g. after unrolling.)

// See the godbolt link below for a version of this with more comments
// SnB/IvB: 6c latency, 2 fused-domain uops.
__m128i isolate_successive_6bits_mul (__m128i input)
{
  // We can avoid the AND if we shift the elements all the way to the left to knock off the high garbage bits.
  // 32 - 6 - 18 = 8 extra bits to left shift
    __m128i mul_constant = _mm_set_epi32(1<<(0+8), 1<<(6+8), 1<<(12+8), 1<<(18+8));
    __m128i left_vshift = _mm_mullo_epi32(input, mul_constant);
    __m128i rightshifted = _mm_srli_epi32(left_vshift, (18+8));
    return rightshifted;
}


A smart way with blends:

Shift and blend so each element ends up with the right total shift count.

AND-mask the low 6 bits after combining everything into one vector.

Same latency as the brute-force way (see below) on Intel CPUs, and better throughput (because of fewer uops). Only two immediate-blends tying up port5 is nice. (AVX2 vpblendd can run on any port, but then we'd just use vpsrlvd.)

// seems to be optimal for Intel CPUs.
__m128i isolate_successive_6bits (__m128i input)
{ // input =   [ D      C      B     A ]
  // output =  [ D>>18  C>>12  B>>6  A ] & set1(0b111111)
    __m128i right12 = _mm_srli_epi32(input, 12);
    __m128i merged = _mm_blend_epi16(input, right12, 0xF0);  // copy upper half, like `movhps` (but don't use that because of extra bypass delay)
    // merged = [ D>>12  C>>12  B>>0  A>>0 ]
    __m128i right6 = _mm_srli_epi32(merged, 6);
    merged = _mm_blend_epi16(merged, right6, 0b11001100);    // blend in the odd elements
    // merged = [ D>>(12+6)  C>>12  B>>(0+6)  A>>0 ]        
    return _mm_and_si128(merged, _mm_set1_epi32(0b111111));  // keep only the low 6 bits
}

I put both versions on the Godbolt compiler explorer.

This version is only 5 uops, compiled with gcc 5.3 -O3 -march=ivybridge:

    # input in xmm0, result in xmm0
isolate_successive_6bits:
    vpsrld          xmm1, xmm0, 12                # starts on cycle 0, result ready for the start of cycle 1
    vpblendw        xmm0, xmm0, xmm1, 240         # cycle 1
    vpsrld          xmm1, xmm0, 6                 # cycle 2
    vpblendw        xmm0, xmm0, xmm1, 204         # cycle 3
    vpand           xmm0, xmm0, XMMWORD PTR .LC0[rip] # cycle 4, result ready on cycle 5
    ret

Every instruction is dependent on the previous, so it has 5c latency. SnB/IvB/HSW/BDW CPUs only have one shift port, so they can't take advantage of the parallelism available in the more brute-force version (which does three shifts with different shift counts). Skylake can, but then the two cycles of blending afterwards eat up the improvement.


The "brute force" way:

Do three shifts the three different shift counts, and use three immediate blends (pblendw) to combine the four vectors into one that has each desired element.

// same latency as the previous version on Skylake
// slower on previous Intel SnB-family CPUs.
isolate_successive_6bits_parallel:
    vpsrld          xmm1, xmm0, 6            # cycle 0.   SKL: c0
    vpsrld          xmm2, xmm0, 12           # cycle 1 (resource conflict on pre-Skylake).  SKL: c0
    vpblendw        xmm1, xmm0, xmm1, 12     # cycle 2 (input dep).  SKL: c1
    vpsrld          xmm3, xmm0, 18           # cycle 2.  SKL: c1
    vpblendw        xmm0, xmm2, xmm3, 192    # cycle 3 (input dep). SKL: c2
    vpblendw        xmm0, xmm1, xmm0, 240    # cycle 4 (input dep). SKL: c3
    vpand           xmm0, xmm0, XMMWORD PTR .LC0[rip]  # cycle 5 (input dep). SKL: c4.
    ret

Doing the merging with a linear dependency chain instead of a tree means merging can finish sooner after the last shift result is ready:

isolate_successive_6bits_parallel2:
    vpsrld          xmm1, xmm0, 6          # c0.  SKL:c0
    vpsrld          xmm2, xmm0, 12         # c1.  SKL:c0
    vpblendw        xmm1, xmm0, xmm1, 12   # c2.  SKL:c1
    vpblendw        xmm1, xmm1, xmm2, 48   # c3.  SKL:c2
    vpsrld          xmm0, xmm0, 18         # c2.  SKL:c1
    vpblendw        xmm0, xmm1, xmm0, 192  # c4.  SKL:c3 (dep on xmm1)
    vpand           xmm0, xmm0, XMMWORD PTR .LC0[rip] # c5.  SKL:c4
    ret

Hmm, nope, doesn't help. No gain in latency for SnB to BDW, or for SKL. The first merge can happen after only one shift because the un-shifted input is what we need for one element. If element 0 needed a non-zero shift count, this way would have an advantage for pre-SKL, and maybe a disadvantage for SKL.

这篇关于将4个整数右移不同的值SIMD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆