将 xmm 寄存器的低两个 32 位浮点数扩展为整个 xmm 寄存器 [英] Expand the lower two 32-bit floats of an xmm register to the whole xmm register

查看:56
本文介绍了将 xmm 寄存器的低两个 32 位浮点数扩展为整个 xmm 寄存器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Intel x86 汇编中执行以下操作的最有效方法是什么(ab 是 32 位浮点数):

xmm1: [-, -, a, b]xmm1: [a, a, b, b]

我找不到任何有用的说明.
我的想法是将 ab 复制到其他寄存器,然后将 xmm1 寄存器移动 4 个字节并移动 ab 到最低 4 个字节.

解决方案

您正在寻找 unpcklps xmm1, xmm1 (https://www.felixcloutier.com/x86/unpcklps) 将寄存器中的低元素与自身交错:
低元素 ->底部 2,第二低到高 2.

您可以改为使用 shufps ,但在这种情况下不会更好,并且需要立即字节.要复制和随机播放,您可以使用 pshufd,但在一些 CPU 上,整数指令在 FP 指令之间较慢(但它通常仍然比 movaps + unpcklps.要么没有旁路延迟,要么是 1 个周期,而 movaps 将花费相同的延迟,但也会消耗一些吞吐量资源.除了 Nehalem,旁路延迟为 2 个周期.我认为没有任何带有 mov 的 CPU-elimination 有用于 shuffle 的旁路延迟,但也许有些 AMD 会这样做.)


如果您在找到正确的 shuffle 指令时遇到困难,请考虑用 C 编写它,看看 clang 是否可以将它变成一个 shuffle 指令.像_mm_set_ps(v[1], v[1], v[0], v[0]).一般来说,它不会总是编译成好的 asm,但值得一试 clang -O3(clang 有一个非常好的随机优化器).在这种情况下,GCC 和 clang 都想出了如何用一个 unpcklps xmm0,xm​​m0 (https://godbolt.org/z/o6PTeP) 而不是可能的灾难.或者与 shufps xmm0,xm​​m0, 5 相反(5 是 0b00'00'01'01).

(请注意,将 __m128 索引为 v[idx] 是一个 GNU 扩展,但我只是建议使用 clang 来找到一个好的 shuffle.如果您最终想要内在函数,请检查 clang 的 asm,然后在代码中使用内在函数,而不是 _mm_set)

另请参阅 Agner Fog 的优化指南中的 SIMD 章节 (https://agner.org/optimize/);他有一个很好的指令表,可以考虑不同类型的数据移动.https://www.officedaytime.com/simd512e/simd.html 也有良好的视觉快速参考,以及 https://software.intel.com/sites/landingpage/IntrinsicsGuide/ 可让您按类别(Swizzle = shuffles)和 ISA 级别进行过滤(因此您可以排除 AVX512,它具有每个带有掩码的内在版本的无数版本.)

另请参阅https://stackoverflow.com/tags/sse/info,了解这些链接及更多内容.>


如果您不太了解可用指令(以及 CPU 架构/性能调整细节),您可能最好将 C 与内在函数结合使用.当您想出一种效率较低的方法来进行 shuffle 时,编译器可以找到更好的方法.例如编译器希望为您将 _mm_shuffle_ps(v,v, _MM_SHUFFLE(1,1,0,0)) 优化为 unpcklps.

手写 asm 是正确的选择是非常罕见的,尤其是对于 x86. 编译器通常在内部函数方面做得很好,尤其是 GCC 和 clang.如果您不知道 unpcklps 的存在,那么您可能离轻松/常规地击败编译器还有很长的路要走.

What is the most efficient way in Intel x86 assembly to do the following operation (a, b are 32-bit floats):

From xmm1: [-, -, a, b] to xmm1: [a, a, b, b]

I could not find any useful instructions.
My idea is to copying a and b to other registers and then shift the xmm1 register 4 bytes and move a or b to the lowest 4 bytes.

解决方案

You're looking for unpcklps xmm1, xmm1 (https://www.felixcloutier.com/x86/unpcklps) to interleave the low elements from a register with itself:
low element -> bottom 2, 2nd lowest to high 2.

You could instead use shufps but that wouldn't be any better in this case, and would need an immediate byte. To copy-and-shuffle, you could use pshufd, but on a few CPUs that integer instruction is slower between FP instructions (but it's still typically better than a movaps + unpcklps. There's either no bypass latency, or it's 1 cycle and movaps would cost the same latency but also some throughput resources. Except Nehalem where bypass latency would be 2 cycles. I don't think any CPUs with mov-elimination have bypass latency for shuffles, but maybe some AMD do.)


If you were having trouble finding the right shuffle instruction, consider writing it in C and seeing if clang can turn it into a shuffle for you. Like _mm_set_ps(v[1], v[1], v[0], v[0]). In general that won't always compile to good asm, but worth a try with clang -O3 (clang has a very good shuffle optimizer). In this case both GCC and clang figure out how to do that with one unpcklps xmm0,xmm0 (https://godbolt.org/z/o6PTeP) instead of the disaster that was possible. Or the reverse with shufps xmm0,xmm0, 5 (5 is 0b00'00'01'01).

(Note that indexing a __m128 as v[idx] is a GNU extension, but I'm only suggesting doing it with clang to find a good shuffle. If you ultimately want intrinsics, check clang's asm then use the intrinsic for that in your code, not a _mm_set)

Also see the SIMD chapter in Agner Fog's optimization guide (https://agner.org/optimize/); he has a good table of instructions to consider for different kinds of data movement. Also https://www.officedaytime.com/simd512e/simd.html has a good visual quick-reference, and https://software.intel.com/sites/landingpage/IntrinsicsGuide/ lets you filter by category (Swizzle = shuffles), and by ISA level (so you can exclude AVX512 which has a bazillion versions of every intrinsic with masking.)

See also https://stackoverflow.com/tags/sse/info for these links and more.


If you don't know the available instructions well (and the CPU-architecture / performance tuning details), you're probably better off using C with intrinsics. The compiler can find better ways when you come up with a less efficient way to do a shuffle. e.g. compilers would hopefully optimize _mm_shuffle_ps(v,v, _MM_SHUFFLE(1,1,0,0)) into unpcklps for you.

It's very rare that hand-written asm is the right choice, especially for x86. Compilers generally do a good job with intrinsics, especially GCC and clang. If you didn't know that unpcklps existed, you're probably a long way from being able to beat the compiler easily / routinely.

这篇关于将 xmm 寄存器的低两个 32 位浮点数扩展为整个 xmm 寄存器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆