8个打包的32位浮点数的水平和 [英] horizontal sum of 8 packed 32bit floats
问题描述
如果我有8个压缩的32位浮点数(__m256
),提取所有8个元素的水平和的最快方法是什么?同样,如何获取水平的最大值和最小值?换句话说,以下C ++函数的最佳实现是什么?
If I have 8 packed 32-bit floating point numbers (__m256
), what's the fastest way to extract the horizontal sum of all 8 elements? Similarly, how to obtain the horizontal maximum and minimum? In other words, what's the best implementation for the following C++ functions?
float sum(__m256 x); ///< returns sum of all 8 elements
float max(__m256 x); ///< returns the maximum of all 8 elements
float min(__m256 x); ///< returns the minimum of all 8 elements
推荐答案
在此处快速记入(因此未经测试):
Quickly jotted down here (and hence untested):
float sum(__m256 x) {
__m128 hi = _mm256_extractf128_ps(x, 1);
__m128 lo = _mm256_extractf128_ps(x, 0);
lo = _mm_add_ps(hi, lo);
hi = _mm_movehl_ps(hi, lo);
lo = _mm_add_ps(hi, lo);
hi = _mm_shuffle_ps(lo, lo, 1);
lo = _mm_add_ss(hi, lo);
return _mm_cvtss_f32(lo);
}
对于最小/最大,将_mm_add_ps
和_mm_add_ss
替换为_mm_max_*
或_mm_min_*
.
For min/max, replace _mm_add_ps
and _mm_add_ss
with _mm_max_*
or _mm_min_*
.
请注意,这需要执行一些操作,因此需要进行大量工作; AVX并非真正旨在有效地进行水平操作.如果您可以将这项工作分批处理成多个向量,那么可能会有更有效的解决方案.
Note that this is a lot of work for a few operations; AVX isn't really intended to do horizontal operations efficiently. If you can batch up this work for multiple vectors, then more efficient solutions are possible.
这篇关于8个打包的32位浮点数的水平和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!