编写std :: copysign的便携式SSE/AVX版本 [英] Writing a portable SSE/AVX version of std::copysign
问题描述
我目前正在使用SSE和AVX内部函数编写矢量化版本的QR分解(线性系统求解器).子步骤之一需要选择与另一个值相反/相等的值的符号.在串行版本中,我为此使用了std :: copysign.现在,我想为SSE /AVX寄存器创建一个类似的功能.不幸的是,STL为此使用了内置函数,因此我不能只复制代码并将其转换为SSE /AVX指令.
I am currently writing a vectorized version of the QR decomposition (linear system solver) using SSE and AVX intrinsics. One of the substeps requires to select the sign of a value opposite/equal to another value. In the serial version, I used std::copysign for this. Now I want to create a similar function for SSE/AVX registers. Unfortunately, the STL uses a built-in function for that, so I can't just copy the code and turn it into SSE/AVX instructions.
我还没有尝试过(所以现在没有代码可显示),但是我的简单方法是创建一个所有值都设置为-0.0的寄存器,以便仅设置带符号的位.然后,我将在源上使用AND操作来确定其符号是否已设置.此操作的结果将为0.0或-0.0,具体取决于源的符号.结果,我将创建一个位掩码(使用逻辑操作),然后将其与目标寄存器结合(使用另一个逻辑操作)以相应地设置符号.
I have not tried it yet (so I have no code to show for now), but my simple approach would be to create a register with all values set to -0.0 so that only the signed bit is set. Then I would use an AND operation on the source to find out if its sign is set or not. The result of this operation would either be 0.0 or -0.0, depending on the sign of the source. With the result, I would create a bitmask (using logic operations) which I can combine with the target register (using another logic operation) to set the sign accordingly.
但是,我不确定是否有更聪明的方法来解决此问题.如果为诸如浮点和双精度之类的基本数据类型提供了内置函数,那么也许还有一个我错过的内在函数.有什么建议吗?
However, I am not sure if there isn't a smarter way to solve this. If there is a built-in function for fundamental data types like floats and doubles, maybe there is also an intrinsic that I missed. Any suggestions?
预先感谢
感谢"chtz"这个有用的链接:
Thanks to "chtz" for this useful link:
因此,基本上std :: copysign编译为2个AND操作和随后的OR的序列.我将为SSE /AVX复制此结果,并将结果发布在此处,以防某天有人需要它:)
So basically std::copysign compiles to a sequence of 2 AND operations and a subsequent OR. I will reproduce this for SSE/AVX and post the result here in case somebody else needs it some day :)
这是我的工作版本:
__m128 CopySign(__m128 srcSign, __m128 srcValue)
{
// Extract the signed bit from srcSign
const __m128 mask0 = _mm_set1_ps(-0.);
__m128 tmp0 = _mm_and_ps(srcSign, mask0);
// Extract the number without sign of srcValue (abs(srcValue))
__m128 tmp1 = _mm_andnot_ps(mask0, srcValue);
// Merge signed bit with number and return
return _mm_or_ps(tmp0, tmp1);
}
使用以下方法进行了测试:
Tested it with:
__m128 a = _mm_setr_ps(1, -1, -1, 1);
__m128 b = _mm_setr_ps(-5, -11, 3, 4);
__m128 c = CopySign(a, b);
for (U32 i = 0; i < 4; ++i)
std::cout << simd::GetValue(c, i) << std::endl;
输出符合预期:
5
-11
-3
4
但是,我还尝试了反汇编版本
However, I also tried the version from the disassembly where
__m128 tmp1 = _mm_andnot_ps(mask0, srcValue);
替换为:
const __m128 mask1 = _mm_set1_ps(NAN);
__m128 tmp1 = _mm_and_ps(srcValue, mask1);
结果很奇怪:
4
-8
-3
4
根据选择的数字,数字有时可以,有时不可以.该标志始终是正确的.出于某种原因,似乎NaN不是!(-0.0).我记得在尝试将寄存器值设置为NaN或特定位模式之前,我遇到了一些问题.也许有人对问题的根源有所了解?
Depending on the chosen numbers, the number is sometimes okay and sometimes not. The sign is always correct. It seems like NaN is not !(-0.0) for some reason. I remember that I had some issues before when I tried to set register values to NaN or specific bit patterns. Maybe somebody has an idea about the origin of the problem?
正如"Maxim Egorushkin"在他的回答的评论中阐明的那样,我对NaN为!(-0.0)的期望是错误的.NaN似乎不是唯一的位模式(请参见 https://steve.hollasch.net/cgindex/coding/ieeefloat.html ).
As 'Maxim Egorushkin' clarified in the comments of his answer, my expectation about NaN being !(-0.0) is wrong. NaN seems not to be a unique bit pattern (see https://steve.hollasch.net/cgindex/coding/ieeefloat.html).
非常感谢大家!
推荐答案
用于 float
和 double
的AVX版本:
#include <immintrin.h>
__m256 copysign_ps(__m256 from, __m256 to) {
constexpr float signbit = -0.f;
auto const avx_signbit = _mm256_broadcast_ss(&signbit);
return _mm256_or_ps(_mm256_and_ps(avx_signbit, from), _mm256_andnot_ps(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
}
__m256d copysign_pd(__m256d from, __m256d to) {
constexpr double signbit = -0.;
auto const avx_signbit = _mm256_broadcast_sd(&signbit);
return _mm256_or_pd(_mm256_and_pd(avx_signbit, from), _mm256_andnot_pd(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
}
使用AVX2,可以在没有常量的情况下生成 avx_signbit
:
With AVX2 avx_signbit
can be generated with no constants:
__m256 copysign2_ps(__m256 from, __m256 to) {
auto a = _mm256_castps_si256(from);
auto avx_signbit = _mm256_castsi256_ps(_mm256_slli_epi32(_mm256_cmpeq_epi32(a, a), 31));
return _mm256_or_ps(_mm256_and_ps(avx_signbit, from), _mm256_andnot_ps(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
}
__m256d copysign2_pd(__m256d from, __m256d to) {
auto a = _mm256_castpd_si256(from);
auto avx_signbit = _mm256_castsi256_pd(_mm256_slli_epi64(_mm256_cmpeq_epi64(a, a), 63));
return _mm256_or_pd(_mm256_and_pd(avx_signbit, from), _mm256_andnot_pd(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
}
尽管如此, clang
和 gcc
都会在编译时计算 avx_signbit
,并将其替换为从 .rodata加载的常量
部分,即IMO次优.
Still though, both clang
and gcc
calculate avx_signbit
at compile time and replace it with constants loaded from .rodata
section, which is, IMO, sub-optimal.
这篇关于编写std :: copysign的便携式SSE/AVX版本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!