如何清除__m256值的高128位? [英] How to clear the upper 128 bits of __m256 value?

查看:87
本文介绍了如何清除__m256值的高128位?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何清除m2的高128位:

How can I clear the upper 128 bits of m2:

__m256i    m2 = _mm256_set1_epi32(2);
__m128i    m1 = _mm_set1_epi32(1);

m2 = _mm256_castsi128_si256(_mm256_castsi256_si128(m2));
m2 = _mm256_castsi128_si256(m1);

不起作用-英特尔关于_mm256_castsi128_si256内在函数的文档说结果向量的高位未定义". 同时,我可以很容易地在组装中完成它:

don't work -- Intel’s documentation for the _mm256_castsi128_si256 intrinsic says that "the upper bits of the resulting vector are undefined". At the same time I can easily do it in assembly:

VMOVDQA xmm2, xmm2  //zeros upper ymm2
VMOVDQA xmm2, xmm1

我当然不想使用"and"或_mm256_insertf128_si256()等.

Of course I'd not like to use "and" or _mm256_insertf128_si256() and such.

推荐答案

更新:现在有一个__m128i _mm256_zextsi128_si256(__m128i)内在函数;参见 Agner Fog的答案 .以下其余答案仅适用于不支持此内在函数且没有高效,可移植解决方案的旧编译器.

Update: there's now a __m128i _mm256_zextsi128_si256(__m128i) intrinsic; see Agner Fog's answer. The rest of the answer below is only relevant for old compilers that don't support this intrinsic, and where there's no efficient, portable solution.

不幸的是,理想的解决方案将取决于您使用的是哪个编译器,而在其中一些上,没有理想的解决方案.

Unfortunately, the ideal solution will depend on which compiler you are using, and on some of them, there is no ideal solution.

我们可以通过以下几种基本方式编写代码:

There are several basic ways that we could write this:

版本A :

ymm = _mm256_set_m128i(_mm_setzero_si128(), _mm256_castsi256_si128(ymm));

版本B :

ymm = _mm256_blend_epi32(_mm256_setzero_si256(),
                         ymm,
                         _MM_SHUFFLE(0, 0, 3, 3));

版本C :

ymm = _mm256_inserti128_si256(_mm256_setzero_si256(),
                              _mm256_castsi256_si128(ymm),
                              0);

其中的每一个都精确地执行了我们想要的操作,清除了256位YMM寄存器的高128位,因此可以安全地使用它们中的任何一个.但是哪个是最佳的?好吧,这取决于您使用的是哪个编译器...

Each of these do precisely what we want, clearing the upper 128 bits of a 256-bit YMM register, so any of them could safely be used. But which is the most optimal? Well, that depends on which compiler you are using...

海湾合作委员会:

版本A:完全不支持,因为GCC缺少_mm256_set_m128i内在函数. (当然可以模拟,但是可以使用"B"或"C"中的一种形式来完成.)

Version A: Not supported at all because GCC lacks the _mm256_set_m128i intrinsic. (Could be simulated, of course, but that would be done using one of the forms in "B" or "C".)

版本B:编译为效率低下的代码.无法识别惯用语,并且将内在函数非常准确地转换为机器代码指令.使用VPXOR将临时YMM寄存器清零,然后使用VPBLENDD将其与输入YMM寄存器混合.

Version B: Compiled to inefficient code. Idiom is not recognized and intrinsics are translated very literally to machine-code instructions. A temporary YMM register is zeroed using VPXOR, and then that is blended with the input YMM register using VPBLENDD.

版本C:理想.尽管代码看起来有些吓人且效率低下,但所有支持AVX2代码生成的GCC版本都可以识别这种习​​惯用法.您将获得预期的VMOVDQA xmm?, xmm?指令,该指令隐式清除了高位.

Version C: Ideal. Although the code looks kind of scary and inefficient, all versions of GCC that support AVX2 code generation recognize this idiom. You get the expected VMOVDQA xmm?, xmm? instruction, which implicitly clears the upper bits.

首选版本C!

lang语:

版本A:编译为效率低下的代码.使用VPXOR将临时YMM寄存器清零,然后使用VINSERTI128(或浮点形式,取决于版本和选项)将其插入到临时YMM寄存器中.

Version A: Compiled to inefficient code. A temporary YMM register is zeroed using VPXOR, and then that is inserted into the temporary YMM register using VINSERTI128 (or the floating-point forms, depending on version and options).

版本B& C:也编译为效率低下的代码.临时的YMM寄存器再次置零,但是在这里,使用VPBLENDD将其与输入的YMM寄存器混合.

Version B & C: Also compiled to inefficient code. A temporary YMM register is again zeroed, but here, it is blended with the input YMM register using VPBLENDD.

没有理想!

ICC :

版本A:理想.产生预期的VMOVDQA xmm?, xmm?指令.

Version A: Ideal. Produces the expected VMOVDQA xmm?, xmm? instruction.

版本B:编译为效率低下的代码.将临时YMM寄存器清零,然后将零与输入YMM寄存器(VPBLENDD)混合.

Version B: Compiled to inefficient code. Zeros a temporary YMM register, and then blends zeros with the input YMM register (VPBLENDD).

版本C:也编译为效率低下的代码.将临时YMM寄存器清零,然后使用VINSERTI128将零插入临时YMM寄存器.

Version C: Also compiled to inefficient code. Zeros a temporary YMM register, and then uses VINSERTI128 to insert zeros into the temporary YMM register.

首选版本A!

MSVC :

版本A和C:编译为效率低下的代码.将临时YMM寄存器清零,然后使用VINSERTI128(A)或VINSERTF128(C)将零插入临时YMM寄存器.

Version A and C: Compiled to inefficient code. Zeroes a temporary YMM register, and then uses VINSERTI128 (A) or VINSERTF128 (C) to insert zeros into the temporary YMM register.

版本B:也编译为低效率的代码.将临时YMM寄存器清零,然后使用VPBLENDD将其与输入YMM寄存器混合.

Version B: Also compiled to inefficient code. Zeros a temporary YMM register, and then blends this with the input YMM register using VPBLENDD.

没有理想!

最后,如果使用正确的代码序列,则可以使GCC和ICC发出理想的VMOVDQA指令.但是,我看不到任何方法来使Clang或MSVC安全地发出VMOVDQA指令.这些编译器缺少优化机会.

In conclusion, then, it is possible to get GCC and ICC to emit the ideal VMOVDQA instruction, if you use the right code sequence. But, I can't see any way to get either Clang or MSVC to safely emit a VMOVDQA instruction. These compilers are missing the optimization opportunity.

因此,在Clang和MSVC上,我们可以在XOR + blend和XOR + insert之间进行选择.哪个更好?我们转向 Agner Fog的指令表(电子表格版本

So, on Clang and MSVC, we have the choice between XOR+blend and XOR+insert. Which is better? We turn to Agner Fog's instruction tables (spreadsheet version also available):

在AMD的Ryzen架构上:(Bulldozer系列与AVX __m256的等效物以及挖掘机上的AVX2相似):

On AMD's Ryzen architecture: (Bulldozer-family is similar for the AVX __m256 equivalents of these, and for AVX2 on Excavator):

  Instruction   | Ops | Latency | Reciprocal Throughput |   Execution Ports
 ---------------|-----|---------|-----------------------|---------------------
   VMOVDQA      |  1  |    0    |          0.25         |   0 (renamed)
   VPBLENDD     |  2  |    1    |          0.67         |   3
   VINSERTI128  |  2  |    1    |          0.67         |   3

Agner Fog似乎在表的Ryzen部分中错过了一些AVX2指令.请参见此AIDA64 InstLatX64结果,以确认VPBLENDD ymmVPBLENDW ymm执行的操作相同在Ryzen上运行,而不是与VBLENDPS ymm相同(2 uop的1c吞吐量可以在2个端口上运行).

Agner Fog seems to have missed some AVX2 instructions in the Ryzen section of his tables. See this AIDA64 InstLatX64 result for confirmation that VPBLENDD ymm performs the same as VPBLENDW ymm on Ryzen, rather than being the same as VBLENDPS ymm (1c throughput from 2 uops that can run on 2 ports).

另请参见挖掘机/Carrizo InstLatX64 ,其中显示了VPBLENDD具有相同的性能(2个周期延迟,每个周期吞吐量1个).与VBLENDPS/VINSERTF128相同.

See also an Excavator / Carrizo InstLatX64 showing that VPBLENDD and VINSERTI128 have equal performance there (2 cycle latency, 1 per cycle throughput). Same for VBLENDPS/VINSERTF128.

在Intel体系结构(Haswell,Broadwell和Skylake)上:

On Intel architectures (Haswell, Broadwell, and Skylake):

  Instruction   | Ops | Latency | Reciprocal Throughput |   Execution Ports
 ---------------|-----|---------|-----------------------|---------------------
   VMOVDQA      |  1  |   0-1   |          0.33         |   3 (may be renamed)
   VPBLENDD     |  1  |    1    |          0.33         |   3
   VINSERTI128  |  1  |    3    |          1.00         |   1

很明显,VMOVDQA在AMD和Intel上都是最佳的,但是我们已经知道了,在改进其代码生成器以识别上述习惯用法之一之前,Clang或MSVC都不是它的选择.或为此目的添加了其他内在函数.

Obviously, VMOVDQA is optimal on both AMD and Intel, but we already knew that, and it doesn't seem to be an option on either Clang or MSVC until their code generators are improved to recognize one of the above idioms or an additional intrinsic is added for this precise purpose.

幸运的是,在AMD和Intel CPU上,VPBLENDD至少与VINSERTI128一样好.在Intel处理器上,VPBLENDD是对VINSERTI128的显着改进. (实际上,在极少数情况下无法重命名VMOVDQA几乎与VMOVDQA一样好,除了需要全零向量常量.)如果可以,最好使用产生VPBLENDD指令的内在序列.不要哄骗您的编译器使用VMOVDQA.

Luckily, VPBLENDD is at least as good as VINSERTI128 on both AMD and Intel CPUs. On Intel processors, VPBLENDD is a significant improvement over VINSERTI128. (In fact, it's nearly as good as VMOVDQA in the rare case where the latter cannot be renamed, except for needing an all-zero vector constant.) Prefer the sequence of intrinsics that results in a VPBLENDD instruction if you can't coax your compiler to use VMOVDQA.

如果您需要此版本的浮点__m256__m256d版本,则选择会更加困难.在Ryzen上,VBLENDPS的吞吐量为1c,但是VINSERTF128的吞吐量为0.67c.在所有其他CPU(包括AMD Bulldozer系列)上,VBLENDPS等于或更好.在Intel上,更好(与整数相同).如果您要专门针对AMD进行优化,则可能需要进行更多测试,以查看在您的特定代码序列中哪种变体最快,否则请进行混合.在Ryzen上只会更糟.

If you need a floating-point __m256 or __m256d version of this, the choice is more difficult. On Ryzen, VBLENDPS has 1c throughput, but VINSERTF128 has 0.67c. On all other CPUs (including AMD Bulldozer-family), VBLENDPS is equal or better. It's much better on Intel (same as for integer). If you're optimizing specifically for AMD, you may need to do more tests to see which variant is fastest in your particular sequence of code, otherwise blend. It's only a tiny bit worse on Ryzen.

总而言之,针对通用x86并支持尽可能多的不同编译器,我们可以做到:

In summary, then, targeting generic x86 and supporting as many different compilers as possible, we can do:

#if (defined _MSC_VER)

    ymm = _mm256_blend_epi32(_mm256_setzero_si256(),
                             ymm,
                             _MM_SHUFFLE(0, 0, 3, 3));

#elif (defined __INTEL_COMPILER)

    ymm = _mm256_set_m128i(_mm_setzero_si128(), _mm256_castsi256_si128(ymm));

#elif (defined __GNUC__)

    // Intended to cover GCC and Clang.
    ymm = _mm256_inserti128_si256(_mm256_setzero_si256(),
                                  _mm256_castsi256_si128(ymm),
                                  0);

#else
    #error "Unsupported compiler: need to figure out optimal sequence for this compiler."
#endif

请参阅这和版本A,B,和C分别上Godbolt编译探险.

See this and versions A,B, and C separately on the Godbolt compiler explorer.

也许您可以在此基础上定义自己的基于宏的内在函数,直到出现更好的情况为止.

Perhaps you could build on this to define your own macro-based intrinsic until something better comes down the pike.

这篇关于如何清除__m256值的高128位?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆