将__m256值设置为全1的最快方法 [英] Fastest way to set __m256 value to all ONE bits

查看:127
本文介绍了将__m256值设置为全1的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何为__m256值的所有位设置值1? 使用AVX还是AVX2内部函数?

How can I set a value of 1 to all bits in an __m256 value? Using either AVX or AVX2 intrinsics?

要获取全零,可以使用_mm256_setzero_si256().

To get all zeros, you can use _mm256_setzero_si256().

要获得全部,我目前正在使用_mm256_set1_epi64x(-1),但是我怀疑这比全零情况要慢.这里是否涉及内存访问或Salar/SSE/AVX切换?

To get all ones, I'm currently using _mm256_set1_epi64x(-1), but I suspect that this is slower than the all-zero case. Is there memory access or Salar/SSE/AVX switching involved here?

在AVX中我似乎找不到简单的按位NOT操作? 如果可以的话,我可以简单地使用setzero,后跟一个矢量NOT.

And I can't seem to find a simple bitwise NOT operation in AVX? If that was available, I could simply use the setzero, followed by a vector NOT.

推荐答案

另请参见

See also Set all bits in CPU register to 1 efficiently which covers AVX, AVX2, and AVX512 zmm and k (mask) registers.

您显然甚至没有看过asm输出,这很简单:

You obviously didn't even look at the asm output, which is trivial to do:

#include <immintrin.h>
__m256i all_ones(void) { return _mm256_set1_epi64x(-1); }

编译到与GCC和铛与任何-march包括AVX2

compiles to with GCC and clang with any -march that includes AVX2

    vpcmpeqd        ymm0, ymm0, ymm0
    ret

要获取__m256(而不是__m256i),您只需投射结果:

To get a __m256 (not __m256i) you can just cast the result:

  __m256 nans = _mm256_castsi256_ps( _mm256_set1_epi32(-1) );

没有AVX2,可能的选项是 vcmptrueps dst, ymm0,ymm0 最好使用冷寄存器作为输入,以减轻错误的依赖性.

Without AVX2, a possible option is vcmptrueps dst, ymm0,ymm0 preferably with a cold register for the input to mitigate the false dependency.

最近的clang(5.0及更高版本)会对向量进行异或零运算,然后如果AVX2不可用,则使用TRUE谓词对vcmpps进行运算.较旧的clang使用vpcmpeqd xmm进行128位全1并使用vinsertf128. GCC可以从内存中加载,甚至使用-march=sandybridge的现代GCC 10.1.

Recent clang (5.0 and later) does xor-zero a vector then vcmpps with a TRUE predicate if AVX2 isn't available. Older clang makes a 128bit all-ones with vpcmpeqd xmm and uses vinsertf128. GCC loads from memory, even modern GCC 10.1 with -march=sandybridge.

Agner Fog的优化装配指南的向量部分所述,以这种方式动态生成常量很便宜它仍然需要一个向量执行单元来生成全数(标签Wiki的问题.

As described by the vector section of Agner Fog's optimizing assembly guide, generating constants on the fly this way is cheap. It still takes a vector execution unit to generate the all-ones (unlike _mm_setzero), but it's better than any possible two-instruction sequence, and usually better than a load. See also the x86 tag wiki.

编译器不喜欢即时生成更复杂的常量,即使是可以从所有-一个简单的转变.即使您尝试编写__m128i float_signbit_mask = _mm_srli_epi32(_mm_set1_epi16(-1), 1),编译器通常也会进行常数传播并将向量存储在内存中.这样一来,他们便可以将其折叠成一个内存操作数,以备以后在没有循环将常量提升出来的情况下使用.

Compilers don't like to generate more complex constants on the fly, even ones that could be generated from all-ones with a simple shift. Even if you try, by writing __m128i float_signbit_mask = _mm_srli_epi32(_mm_set1_epi16(-1), 1), compilers typically do constant-propagation and put the vector in memory. This lets them fold it into a memory operand when used later in cases where there's no loop to hoist the constant out of.

我似乎无法在AVX中找到简单的按位NOT操作?

And I can't seem to find a simple bitwise NOT operation in AVX?

您可以通过对所有与vxorps(_mm256_xor_ps)进行异或运算来实现此目的.不幸的是,SSE/AVX没有提供没有向量常量的NOT的方法.

You do that by XORing with all-ones with vxorps (_mm256_xor_ps). Unfortunately SSE/AVX don't provide a way to do a NOT without a vector constant.

FP与整数指令以及旁路延迟

Intel CPU(至少是Skylake)具有怪异的效果,即在执行产生寄存器的uop之后很长时间,SIMD整数和SIMD-FP之间的额外旁路等待时间仍然发生.例如对于ymm2->,vmulps ymm1, ymm2, ymm0可能有一个额外的延迟周期.如果vpcmpeqdvpcmpeqd产生,则ymm1关键路径.如果您不覆盖ymm0,这将一直持续到下一个上下文切换恢复FP状态为止.

Intel CPUs (at least Skylake) have a weird effect where the extra bypass latency between SIMD-integer and SIMD-FP still happens long after the uop producing the register has executed. e.g. vmulps ymm1, ymm2, ymm0 could have an extra cycle of latency for the ymm2 -> ymm1 critical path if ymm0 was produced by vpcmpeqd. And this lasts until the next context switch restores FP state if you don't otherwise overwrite ymm0.

对于像vxorps这样的按位指令,这不是问题(即使助记符具有ps,它也没有来自Skylake,IIRC上的FP或vec-int域的旁路延迟).

This is not a problem for bitwise instructions like vxorps (even though the mnemonic has ps, it doesn't have bypass delay from FP or vec-int domains on Skylake, IIRC).

因此通常使用整数指令创建set1(-1)常量是安全的,因为这是一个NaN,并且您通常不会将其与mul或add这样的FP数学指令一起使用.

So normally it's safe to create a set1(-1) constant with an integer instruction because that's a NaN and you wouldn't normally use it with FP math instructions like mul or add.

这篇关于将__m256值设置为全1的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆