SSE/AVX:基于每个元素的最小和最大绝对值,从两个__m256浮点向量中进行选择 [英] SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value
问题描述
我正在寻找
// Given
float u[8];
float v[8];
// Compute
float a[8];
float b[8];
// Such that
for ( int i = 0; i < 8; ++i )
{
a[i] = fabs(u[i]) >= fabs(v[i]) ? u[i] : v[i];
b[i] = fabs(u[i]) < fabs(v[i]) ? u[i] : v[i];
}
即,我需要根据 mask
从 u
和 v
中的 a
中逐个元素地选择,并根据!mask
放入 b
,其中 mask =(fabs(u)> = fabs(v))
逐元素./p>
I.e., I need to select element-wise into a
from u
and v
based on mask
, and into b
based on !mask
, where mask = (fabs(u) >= fabs(v))
element-wise.
推荐答案
前几天,我也遇到了同样的问题.我想出的解决方案(仅使用AVX)是:
I had this exact same problem just the other day. The solution I came up with (using AVX only) was:
// take the absolute value of u and v
__m256 sign_bit = _mm256_set1_ps(-0.0f);
__m256 u_abs = _mm256_andnot_ps(sign_bit, u);
__m256 v_abs = _mm256_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__m256 u_ge_v = _mm256_cmp_ps(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m256 a = _mm256_blendv_ps(u, v, u_ge_v);
__m256 b = _mm256_blendv_ps(v, u, u_ge_v);
相当于AVX512的是:
The AVX512 equivalent would be:
// take the absolute value of u and v
__m512 sign_bit = _mm512_set1_ps(-0.0f);
__m512 u_abs = _mm512_andnot_ps(sign_bit, u);
__m512 v_abs = _mm512_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__mmask16 u_ge_v = _mm512_cmp_ps_mask(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m512 a = _mm512_mask_blend_ps(u_ge_v, u, v);
__m512 b = _mm512_mask_blend_ps(u_ge_v, v, u);
正如彼得·科德斯(Peter Cordes)在上述评论中所建议的那样,还有其他一些方法,例如,取绝对值后接一个最小值/最大值,然后重新插入符号位,但我找不到比该值更短/更低的延迟的东西.此指令序列.
As Peter Cordes suggested in the comments above, there are other approaches as well like taking the absolute value followed by a min/max and then reinserting the sign bit, but I couldn't find anything that was shorter/lower latency than this sequence of instructions.
这篇关于SSE/AVX:基于每个元素的最小和最大绝对值,从两个__m256浮点向量中进行选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!