改组两个__m128i的64位部分的最佳方法 [英] Best way to shuffle 64-bit portions of two __m128i's

查看:119
本文介绍了改组两个__m128i的64位部分的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个__m128iab,我想进行混洗,以使a的高64位落入dst的低64位,而dst的低64位b落在dst的高64位.即

I have two __m128is, a and b, that I want to shuffle so that the upper 64 bits of a fall in the lower 64 bits of dst and the lower 64 bits of b fall in the upper 64 of dst. i.e.

dst[ 0:63]  = a[64:127]
dst[64:127] = b[0:63]

等同于:

__m128i dst = _mm_unpacklo_epi64(_mm_srli_si128i(a, 8), b);

__m128i dst = _mm_castpd_si128(mm_shuffle_pd(_mm_castsi128_pd(a),_mm_castsi128_pd(b),1));

是否有比第一种方法更好的方法?第二条指令只是一条指令,但是切换到浮点SIMD执行比第一条指令的额外指令成本更高.

Is there a better way to do this than the first method? The second one is just one instruction, but the switch to the floating point SIMD execution is more costly than the extra instruction from the first.

推荐答案

延迟并不总是最糟糕的事情.如果它不是循环进行的dep链的一部分,则只需使用一条指令即可.

Latency isn't always the worst thing ever. If it's not part of a loop-carried dep-chain, then just use the single instruction.

此外,可能没有任何内容! Agner Fog的 microarch doc 说,在某些情况下,使用错误"类型的随机播放或布尔值时,他没有发现额外的延迟,在Sandybridge上.混合仍然有额外的延迟.他说,在哈斯韦尔(Haswell)上,混洗类型完全没有多余的延迟. (第140页,数据绕过延迟.)

Also, there might not be any! Agner Fog's microarch doc says he found no extra latency in some cases when using the "wrong" type of shuffle or boolean, on Sandybridge. Blends still have the extra latency. On Haswell, he says there are no extra delays at all for mixing types of shuffle. (pg 140, Data Bypass Delays.)

因此,继续使用shufps,除非您非常关心您的代码在Nehalem上的运行速度如何. (以前的设计(merom/conroe和Penryn)没有因使用错误的移动或随机播放而产生额外的旁路延迟.)

So go ahead and use shufps, unless you care a lot about your code being fast on Nehalem. (Previous designs (merom/conroe, and Penryn) didn't have extra bypass delays for using the wrong move or shuffle.)

对于AMD,shufps在ivec域中运行,与整数改组相同,因此可以使用它.像Intel一样,FP混合在FP域中运行,因此对FP数据没有旁路延迟.

For AMD, shufps runs in the ivec domain, same as integer shuffles, so it's fine to use it. Like Intel, FP blends run in the FP domain, and thus have no bypass delay for FP data.

如果根据所支持的指令集包括多个asm版本,而不必像x264那样完全为每个CPU拥有所有版本的最佳版本,您可能会在AVX CPU的版本中使用错误类型的操作,但是在非AVX版本中使用多个说明. Nehalem的罚款很大(每个域转换有2个循环旁路延迟),而Sandybridge是0或1个循环. SnB是AVX的第一代产品.

If you include multiple asm versions depending on which instruction sets are supported, without going completely nuts about having the optimal version of everything for every CPU like x264 does, you might use wrong-type ops in your version for AVX CPUs, but use multiple instructions in your non-AVX version. Nehalem has large penalties (2 cycle bypass delays for each domain transition), while Sandybridge is 0 or 1 cycle. SnB is the first generation with AVX.

Nehalem之前的版本(没有SSE4.2)太旧了,即使它没有对错误类型"随机播放的任何惩罚,也可能不值得为其专门调整版本. Nehalem有点慢,马上就要到了风口浪尖,因此在这些系统上运行的软件将最难以实时运行,或者不感到缓慢.因此,对Nehalem不好会增加糟糕的用户体验,因为他们的系统已经不是最快的.

Pre-Nehalem (no SSE4.2) is so old that it's probably not worth tuning a version specifically for it, even though it doesn't have any penalties for "wrong type" shuffles. Nehalem is right on the cusp of being kinda slow, so software running on those systems will have the hardest time operating in real-time, or not feeling slow. Thus, being bad on Nehalem would add to a bad user experience since their system is already not the fastest.

这篇关于改组两个__m128i的64位部分的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆