霓虹灯vuzp的sse/avx等效项 [英] sse/avx equivalent for neon vuzp

查看:120
本文介绍了霓虹灯vuzp的sse/avx等效项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Intel的向量扩展名SSE,AVX等为每种元素大小提供了两个解压缩操作,例如SSE内在函数是 _mm_unpacklo _ * _mm_unpackhi _ * .对于向量中的4个元素,它将执行以下操作:

 输入:(A0 A1 A2 A3)(B0 B1 B2 B3)unpacklo/hi:(A0 B0 A1 B1)(A2 B2 A3 B3) 

相当于解压缩的是ARM NEON指令集中的 vzip .但是,NEON指令集还提供了 vuzp 操作,它是 vzip 的反函数.对于向量中的4个元素,它将执行以下操作:

 输入:(A0 A1 A2 A3)(B0 B1 B2 B3)vuzp:(A0 A2 B0 B2)(A1 A3 B1 B3) 

如何使用SSE或AVX内在函数有效地实现 vuzp ?似乎没有针对它的说明.对于4个元素,我认为可以使用随机播放和随后的拆包移动2个元素来完成:

 输入:(A0 A1 A2 A3)(B0 B1 B2 B3)随机播放:(A0 A2 A1 A3)(B0 B2 B1 B3)unpacklo/hi 2:(A0 A2 B0 B2)(A1 A3 B1 B3) 

使用一条指令是否有更有效的解决方案?(也许对于SSE来说是第一位-我知道对于AVX,我们可能还有一个额外的问题,即乱码和拆包不会越过车道.)

知道这一点对于编写用于数据混乱和反混乱的代码可能很有用(仅通过基于解压缩操作将混乱代码的操作反转,就应该可以导出混乱代码).

这是8元素的版本:这是NEON的 vuzp 的效果:

 输入:(A0 A1 A2 A3 A4 A5 A6 A7)(B0 B1 B2 B3 B4 B5 B6 B7)vuzp:(A0 A2 A4 A6 B0 B2 B4 B6)(A1 A3 A5 A7 B1 B3 B5 B7) 

这是我的版本,每个输出元素都有一个 shuffle 和一个 unpack (似乎可以推广到更大的元素编号):

 输入:(A0 A1 A2 A3 A4 A5 A6 A7)(B0 B1 B2 B3 B4 B5 B6 B7)随机播放:(A0 A2 A4 A6 A1 A3 A5 A7)(B0 B2 B4 B6 B1 B3 B5 B7)unpacklo/hi 4:(A0 A2 A4 A6 B0 B2 B4 B6)(A1 A3 A5 A7 B1 B3 B5 B7) 

EOF建议的方法是正确的,但每个输出都需要 log2(8)= 3 unpack 操作:

 输入:(A0 A1 A2 A3 A4 A5 A6 A7)(B0 B1 B2 B3 B4 B5 B6 B7)unpacklo/hi 1:((A0 B0 A1 B1 A2 B2 A3 B3)(A4 B4 A5 B5 A6 B6 A7 B7)unpacklo/hi 1:(A0 A4 B0 B4 A1 A5 B1 B5)(A2 A6 B2 B6 A3 A7 B3 B7)unpacklo/hi 1:(A0 A2 A4 A6 B0 B2 B4 B6)(A1 A3 A5 A7 B1 B3 B5 B7) 

解决方案

仅通过反转操作就应该可以导出令人毛骨悚然的代码

习惯于因英特尔的向量重组的非正交性而感到失望和沮丧.对于 punpck ,没有直接的逆函数.SSE/AVX pack 指令用于缩小元素大小.(因此, packusdw punpck [lh] wd 反对零,但使用时不反对带有两个任意向量).同样, pack 指令仅适用于32-> 16(双字至字)和16-> 8(单字至字节)元素大小.没有 packusqd (64-> 32).

PACK指令仅在饱和时可用,而不能截断(直到AVX512 vpmovqd ),因此对于此用例,我们需要为2个PACK指令准备4个不同的输入向量.事实证明,这太可怕了,比您的3-shuffle解决方案要糟糕得多(请参阅下面的Godbolt链接中的 unzip32_pack()).


虽然有2输入随机播放,但是可以满足32位元素的需要: _MM_SHUFFLE()使用最重要的元素第一个符号,就像所有英特尔文档一样.您的表示法是相反的.

shufps 的唯一内在函数使用 __ m128 / __ m256 向量( float 不是整数),因此投放使用. _mm_castsi128_ps 是一个reinterpret_cast:它编译为零个指令.

  #include< immintrin.h>静态内联__m128i unziplo(__ m128i a,__m128i b){__m128 aps = _mm_castsi128_ps(a);__m128 bps = _mm_castsi128_ps(b);__m128 lo = _mm_shuffle_ps(aps,bps,_MM_SHUFFLE(2,0,2,0));返回_mm_castps_si128(lo);}静态内联__m128i unziphi(__ m128i a,__m128i b){__m128 aps = _mm_castsi128_ps(a);__m128 bps = _mm_castsi128_ps(b);__m128 hi = _mm_shuffle_ps(aps,bps,_MM_SHUFFLE(3,1,3,1));返回_mm_castps_si128(hi);} 

gcc会将它们分别内联到一条指令中.删除 static inline 后,我们可以看到它们如何作为非内联函数进行编译.我把他们的所述Godbolt编译器资源管理器

  unziplo(long long __vector(2),long long __vector(2)):连拍xmm0,xmm1、136退回解压缩(long long __vector(2),long long __vector(2)):shufps xmm0,xmm1、221退回 

在最近的Intel/AMD CPU上对整数数据使用FP随机播放是可以的.没有多余的旁路延迟延迟(请参见此答案总结了 Agner Fog的微体系结构指南所说的内容).它在Intel Nehalem上具有额外的延迟,但仍然可能是那里的最佳选择.FP加载/随机播放不会出错或损坏表示NaN的整数位模式,只有实际的FP数学指令才在乎那个.

有趣的事实:在AMD Bulldozer系列CPU(和Intel Core2)上,像 shufps 这样的FP shuffle仍在ivec域中运行,因此当在FP指令之间使用时,它们实际上会有额外的延迟,但是整数指令之间!


与ARM NEON/ARMv8 SIMD不同, x86 SSE没有任何2输出寄存器指令,并且在x86中很少见.(它们存在,例如 mul r64 ,但始终解码为多个在当前CPU上定位).

总是需要至少2条指令才能创建2个结果向量.如果它们都不需要同时在shuffle端口上运行,那将是理想的选择,因为最近的Intel CPU的shuffle吞吐量仅为每个时钟1.当您的所有指令都是随机播放时,指令级并行性并没有多大帮助.

对于吞吐量,1个shuffle + 2个非shuffle可能比2个shuffle更有效率,并且具有相同的延迟.甚至2个改组和2个混合可能比3个改组更有效,具体取决于周围代码中的瓶颈.但是我不认为我们可以用那么少的指令替换2x shufps .


没有 SHUFPS :

您的洗牌+ unpacklo/hi非常好.总共将进行4次洗牌:准备输入的2个 pshufd ,然后2个 punpck l/h.这可能比任何旁路延迟都更糟,除非在延迟时间很重要但吞吐量无关紧要的情况下在Nehalem上使用.

任何其他选择似乎都需要准备4个输入向量,以用于混合或 packss .参见 @Mysticial对_mm_shuffle_ps()等价于整数矢量(__m128i)的答案吗?作为混合选项.对于两个输出,总共需要进行4次混洗才能输入,然后是2个 pblendw (快速)或 vpblendd (甚至更快).

对16位或8位元素使用 packsswd wb 也可以.要屏蔽掉a和b的奇数元素,需要2个 pand 指令,而要使奇数元素向下移动到偶数位置,则需要2x psrld .这将为您设置2个 packsswd 创建两个输出向量.总共6条指令,外加许多 movdqa ,因为这些指令都破坏了它们的输入(与 pshufd 不同,后者是复制+随机播放).

 //不要使用它,对于任何CPU来说都不是最佳选择无效unzip32_pack(__ m128i& a,__ m128i& b){__m128i a_even = _mm_and_si128(a,_mm_setr_epi32(-1,0,-1,0));__m128i a_odd = _mm_srli_epi64(a,32);__m128i b_even = _mm_and_si128(b,_mm_setr_epi32(-1,0,-1,0));__m128i b_odd = _mm_srli_epi64(b,32);__m128i lo = _mm_packs_epi16(a_even,b_even);__m128i hi = _mm_packs_epi16(a_odd,b_odd);a = lo;b =嗨;} 

Nehalem是唯一值得使用2倍 shufps 以外的东西的CPU,因为它具有较高的(2c)旁路延迟.每个时钟有2个shuffle吞吐量,并且 pshufd 是copy + shuffle,因此2x pshufd 可以准备 a b的副本之后只需要一个额外的 movdqa 即可将 punpckldq punpckhdq 结果存储到单独的寄存器中.( movdqa 不是免费的;它的延迟为1c,并且需要在Nehalem上使用向量执行端口.如果您在洗牌吞吐量上遇到瓶颈,而不是整个前端带宽,它只比洗牌便宜(uop吞吐量)之类的东西.

我非常建议只使用2倍 shufps .


AVX512

AVX512引入了带有截断的跨行打包指令,该指令缩小了单个向量的范围(而不是2输入的混洗).它是 pmovzx 的反函数,可以缩小64b-> 8b或任何其他组合,而不是仅缩小2倍.

在这种情况下, vpermi2d/vpermt2d .

>

Intel's vector extensions SSE, AVX, etc. provide two unpack operations for each element size, e.g. SSE intrinsics are _mm_unpacklo_* and _mm_unpackhi_*. For 4 elements in a vector, it does this:

inputs:      (A0 A1 A2 A3) (B0 B1 B2 B3)
unpacklo/hi: (A0 B0 A1 B1) (A2 B2 A3 B3)

The equivalent of unpack is vzip in ARM's NEON instruction set. However, the NEON instruction set also provides the operation vuzp which is the inverse of vzip. For 4 elements in a vector, it does this:

inputs: (A0 A1 A2 A3) (B0 B1 B2 B3)
vuzp:   (A0 A2 B0 B2) (A1 A3 B1 B3)

How can vuzp be implemented efficiently using SSE or AVX intrinsics? There doesn't seem to be an instruction for it. For 4 elements, I assume it can be done using a shuffle and a subsequent unpack moving 2 elements:

inputs:        (A0 A1 A2 A3) (B0 B1 B2 B3)
shuffle:       (A0 A2 A1 A3) (B0 B2 B1 B3)
unpacklo/hi 2: (A0 A2 B0 B2) (A1 A3 B1 B3)

Is there a more efficient solution using a single instruction? (Maybe for SSE first - I'm aware that for AVX we may have the additional problem that shuffle and unpack don't cross lanes.)

Knowing this may be useful for writing code for data swizzling and deswizzling (it should be possible to derive deswizzling code just by inverting the operations of swizzling code based on unpack operations).

Edit: Here is the 8-element version: This is the effect of NEON's vuzp:

input:         (A0 A1 A2 A3 A4 A5 A6 A7) (B0 B1 B2 B3 B4 B5 B6 B7)
vuzp:          (A0 A2 A4 A6 B0 B2 B4 B6) (A1 A3 A5 A7 B1 B3 B5 B7)

This is my version with one shuffle and one unpack for each output element (seems to generalize to larger element numbers):

input:         (A0 A1 A2 A3 A4 A5 A6 A7) (B0 B1 B2 B3 B4 B5 B6 B7)
shuffle:       (A0 A2 A4 A6 A1 A3 A5 A7) (B0 B2 B4 B6 B1 B3 B5 B7)
unpacklo/hi 4: (A0 A2 A4 A6 B0 B2 B4 B6) (A1 A3 A5 A7 B1 B3 B5 B7)

The method suggested by EOF is correct but would require log2(8)=3 unpack operations for each output:

input:         (A0 A1 A2 A3 A4 A5 A6 A7) (B0 B1 B2 B3 B4 B5 B6 B7)
unpacklo/hi 1: (A0 B0 A1 B1 A2 B2 A3 B3) (A4 B4 A5 B5 A6 B6 A7 B7)
unpacklo/hi 1: (A0 A4 B0 B4 A1 A5 B1 B5) (A2 A6 B2 B6 A3 A7 B3 B7)
unpacklo/hi 1: (A0 A2 A4 A6 B0 B2 B4 B6) (A1 A3 A5 A7 B1 B3 B5 B7)

解决方案

it should be possible to derive deswizzling code just by inverting the operations

Get used to being disappointed and frustrated by the non-orthogonality of Intel's vector shuffles. There is no direct inverse for punpck. The SSE/AVX pack instructions are for narrowing the element size. (So one packusdw is the inverse of punpck[lh]wd against zero, but not when used with two arbitrary vectors). Also, pack instructions are only available for 32->16 (dword to word) and 16->8 (word to byte) element size. There is no packusqd (64->32).

PACK instructions are only available with saturation, not truncation (until AVX512 vpmovqd), so for this use-case we'd need to prepare 4 different input vectors for 2 PACK instructions. This turns out to be horrible, much worse than your 3-shuffle solution (see unzip32_pack() in the Godbolt link below).


There is a 2-input shuffle that will do what you want for 32-bit elements, though: shufps. The low 2 elements of the result can be any 2 elements of the first vector, and the high 2 element can be any elements of the second vector. The shuffle we want fits those constraints, so we can use it.

We can solve the whole problem in 2 instructions (plus a movdqa for the non-AVX version, because shufps destroys the left input register):

inputs: a=(A0 A1 A2 A3) a=(B0 B1 B2 B3)
_mm_shuffle_ps(a,b,_MM_SHUFFLE(2,0,2,0)); // (A0 A2 B0 B2)
_mm_shuffle_ps(a,b,_MM_SHUFFLE(3,1,3,1)); // (A1 A3 B1 B3)

_MM_SHUFFLE() uses most-significant-element first notation, like all of Intel's documentation. Your notation is opposite.

The only intrinsic for shufps uses __m128 / __m256 vectors (float not integer), so you have to cast to use it. _mm_castsi128_ps is a reinterpret_cast: it compiles to zero instructions.

#include <immintrin.h>
static inline
__m128i unziplo(__m128i a, __m128i b) {
    __m128 aps = _mm_castsi128_ps(a);
    __m128 bps = _mm_castsi128_ps(b);
    __m128 lo = _mm_shuffle_ps(aps, bps, _MM_SHUFFLE(2,0,2,0));
    return _mm_castps_si128(lo);
}

static inline    
__m128i unziphi(__m128i a, __m128i b) {
    __m128 aps = _mm_castsi128_ps(a);
    __m128 bps = _mm_castsi128_ps(b);
    __m128 hi = _mm_shuffle_ps(aps, bps, _MM_SHUFFLE(3,1,3,1));
    return _mm_castps_si128(hi);
}

gcc will inline these to a single instruction each. With the static inline removed, we can see how they'd compile as non-inline functions. I put them on the Godbolt compiler explorer

unziplo(long long __vector(2), long long __vector(2)):
    shufps  xmm0, xmm1, 136
    ret
unziphi(long long __vector(2), long long __vector(2)):
    shufps  xmm0, xmm1, 221
    ret

Using FP shuffles on integer data is fine on recent Intel/AMD CPUs. There is no extra bypass-delay latency (See this answer which summarizes what Agner Fog's microarch guide says about it). It has extra latency on Intel Nehalem , but may still be the best choice there. FP loads/shuffles won't fault or corrupt integer bit-patterns that represent a NaN, only actual FP math instructions care about that.

Fun fact: on AMD Bulldozer-family CPUs (and Intel Core2), FP shuffles like shufps still run in the ivec domain, so they actually have extra latency when used between FP instructions, but not between integer instructions!


Unlike ARM NEON / ARMv8 SIMD, x86 SSE doesn't have any 2-output-register instructions, and they're rare in x86. (They exist, e.g. mul r64, but always decode to multiple uops on current CPUs).

It's always going to take at least 2 instructions to create 2 vectors of results. It would be ideal if they didn't both need to run on the shuffle port, since recent Intel CPUs have a shuffle throughput of only 1 per clock. Instruction-level parallelism doesn't help much when all your instructions are shuffles.

For throughput, 1 shuffle + 2 non-shuffles could be more efficient than 2 shuffles, and have the same latency. Or even 2 shuffles and 2 blends could be more efficient than 3 shuffles, depending on what the bottleneck is in the surrounding code. But I don't think we can replace 2x shufps with that few instructions.


Without SHUFPS:

Your shuffle + unpacklo/hi is pretty good. It would be 4 shuffles total: 2 pshufd to prepare the inputs, then 2 punpckl/h. This is likely to be worse than any bypass latency, except on Nehalem in cases where latency matters but throughput doesn't.

Any other option would seem to require preparing 4 input vectors, for either a blend or packss. See @Mysticial's answer to _mm_shuffle_ps() equivalent for integer vectors (__m128i)? for the blend option. For two outputs, that would take a total of 4 shuffles to make the inputs, and then 2x pblendw (fast) or vpblendd (even faster).

Using packsswd or wb for 16 or 8 bit elements would also work. It would take 2x pand instructions to mask off the odd elements of a and b, and 2x psrld to shift the odd elements down to the even positions. That sets you up for 2x packsswd to create the two output vectors. 6 total instructions, plus many movdqa because those all destroy their inputs (unlike pshufd which is a copy+shuffle).

// don't use this, it's not optimal for any CPU
void unzip32_pack(__m128i &a, __m128i &b) {
    __m128i a_even = _mm_and_si128(a, _mm_setr_epi32(-1, 0, -1, 0));
    __m128i a_odd  = _mm_srli_epi64(a, 32);
    __m128i b_even = _mm_and_si128(b, _mm_setr_epi32(-1, 0, -1, 0));
    __m128i b_odd  = _mm_srli_epi64(b, 32);
    __m128i lo = _mm_packs_epi16(a_even, b_even);
    __m128i hi = _mm_packs_epi16(a_odd, b_odd);
    a = lo;
    b = hi;
}

Nehalem is the only CPU where it might be worth using something other than 2x shufps, because of it's high (2c) bypass delay. It has 2 per clock shuffle throughput, and pshufd is a copy+shuffle, so 2x pshufd to prepare copies of a and b would only need one extra movdqa after that to get the punpckldq and punpckhdq results into separate registers. (movdqa isn't free; it has 1c latency and needs a vector execution port on Nehalem. It's only cheaper than a shuffle if you're bottlenecked on shuffle throughput, rather than overall front-end bandwidth (uop throughput) or something.)

I very much recommend just using 2x shufps. It will be good on the average CPU, and not horrible anywhere.


AVX512

AVX512 introduced a lane-crossing pack-with-truncation instruction that narrows a single vector (instead of being a 2-input shuffle). It's the inverse of pmovzx, and can narrow 64b->8b or any other combination, instead of only by a factor of 2.

For this case, __m256i _mm512_cvtepi64_epi32 (__m512i a) (vpmovqd) will take the even 32-bit elements from a vector and pack them together. (i.e. the low halves of each 64-bit element). It's still not a good building block for an interleave, though, since you need something else to get the odd elements into place.

It also comes in signed/unsigned saturation versions. The instructions even have a memory-destination form that the intrinsics expose to let you do a masked-store.

But for this problem, as Mysticial points out, AVX512 provides 2-input lane-crossing shuffles which you can use like shufps to solve the whole problem in just two shuffles: vpermi2d/vpermt2d.

这篇关于霓虹灯vuzp的sse/avx等效项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆