如何从AVX寄存器中获取数据? [英] How to get data out of AVX registers?

查看:462
本文介绍了如何从AVX寄存器中获取数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用MSVC 2013和AVX 1,我在寄存器中有8个浮点数:

Using MSVC 2013 and AVX 1, I've got 8 floats in a register:

__m256 foo = mm256_fmadd_ps(a,b,c);

现在我要为所有8个浮点数调用inline void print(float) {...}.看来 Intel AVX内部特性会使这变得相当复杂:

Now I want to call inline void print(float) {...} for all 8 floats. It looks like the Intel AVX intrisics would make this rather complicated:

print(_castu32_f32(_mm256_extract_epi32(foo, 0)));
print(_castu32_f32(_mm256_extract_epi32(foo, 1)));
print(_castu32_f32(_mm256_extract_epi32(foo, 2)));
// ...

但是MSVC甚至没有这两个内在函数中的任何一个.当然,我可以将这些值写回内存并从那里加载,但是我怀疑在汇编级别不需要溢出寄存器.

but MSVC doesn't even have either of these two intrinsics. Sure, I could write back the values to memory and load from there, but I suspect that at assembly level there's no need to spill a register.

奖金Q:我当然想写

for(int i = 0; i !=8; ++i) 
    print(_castu32_f32(_mm256_extract_epi32(foo, i)))

但是MSVC无法理解许多内在的 require 循环展开过程.如何在__m256 foo中的8x32浮点数上编写循环?

but MSVC doesn't understand that many intrinsics require loop unrolling. How do I write a loop over the 8x32 floats in __m256 foo?

推荐答案

注意:_mm256_fmadd_ps不是AVX1的一部分. FMA3有其自己的功能位,并且仅在Haswell的Intel上引入. AMD推出了带有打桩机的FMA3(AVX1 + FMA4 + FMA3,没有AVX2).

Careful: _mm256_fmadd_ps isn't part of AVX1. FMA3 has its own feature bit, and was only introduced on Intel with Haswell. AMD introduced FMA3 with Piledriver (AVX1+FMA4+FMA3, no AVX2).

在asm级别上,如果要将八个32位元素放入整数寄存器,则实际上存储到堆栈中然后进行标量加载会更快. pextrd是有关SnB系列和Bulldozer系列的2 uop指令. (还有不支持AVX的Nehalem和Silvermont).

At the asm level, if you want to get eight 32bit elements into integer registers, it is actually faster to store to the stack and then do scalar loads. pextrd is a 2-uop instruction on SnB-family, and Bulldozer-family. (and Nehalem and Silvermont, which don't support AVX).

vextractf128 + 2x movd + 6x pextrd并不可怕的唯一CPU是AMD Jaguar. (便宜的pextrd,只有一个装载端口.)(请参阅 Agner Fog的insn表)

The only CPU where vextractf128 + 2xmovd + 6xpextrd isn't terrible is AMD Jaguar. (cheap pextrd, and only one load port.) (See Agner Fog's insn tables)

一个大范围对齐的存储可以转发重叠的狭窄负载. (当然,您可以使用movd来获取低位元素,因此可以同时使用负载端口和ALU端口.)

A wide aligned store can forward to overlapping narrow loads. (Of course, you can use movd to get the low element, so you have a mix of load port and ALU port uops).

当然,您似乎正在通过使用整数提取来提取float,然后将其转换回浮点数.这似乎很可怕.

Of course, you seem to be extracting floats by using an integer extract and then converting it back to a float. That seems horrible.

实际需要的是每个float在其自己的xmm寄存器的low元素中. vextractf128显然是开始的方式,将元素4置于新的xmm reg的底部.然后6x AVX shufps可以轻松获得每一半的其他三个元素. (或者movshdupmovhlps的编码较短:没有立即字节).

What you actually need is each float in the low element of its own xmm register. vextractf128 is obviously the way to start, bringing element 4 to the bottom of a new xmm reg. Then 6x AVX shufps can easily get the other three elements of each half. (Or movshdup and movhlps have shorter encodings: no immediate byte).

与1个存储区和7个加载uos相比,值得考虑使用7个shuffle uops,但是如果您还是要为函数调用散布向量,则不值得.

7 shuffle uops are worth considering vs. 1 store and 7 load uops, but not if you were going to spill the vector for a function call anyway.

您在Windows上,其中xmm6-15保留了呼叫(仅low128; ymm6-15的上半部分被保留了电话).这是从vextractf128开始的另一个原因.

You're on Windows, where xmm6-15 are call-preserved (only the low128; the upper halves of ymm6-15 are call-clobbered). This is yet another reason to start with vextractf128.

在SysV ABI中,所有xmm/ymm/zmm寄存器都被调用,因此每个print()函数都需要溢出/重载.唯一明智的做法是存储到内存中,并使用原始向量调用print(即打印低位元素,因为它将忽略寄存器的其余部分).然后movss xmm0, [rsp+4]并在第二个元素上调用print,依此类推.

In the SysV ABI, all the xmm / ymm / zmm registers are call-clobbered, so every print() function requires a spill/reload. The only sane thing to do there is store to memory and call print with the original vector (i.e. print the low element, because it will ignore the rest of the register). Then movss xmm0, [rsp+4] and call print on the 2nd element, etc.

将所有8个浮点数很好地解压到8个向量reg中对您没有好处,因为无论如何在第一个函数调用之前它们都必须分别溢出!

It does you no good to get all 8 floats nicely unpacked into 8 vector regs, because they'd all have to be spilled separately anyway before the first function call!

这篇关于如何从AVX寄存器中获取数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆