SSE减少浮点向量 [英] SSE reduction of float vector
问题描述
如何使用sse内在函数获取浮点向量的求和元素(归约)?
How can I get sum elements (reduction) of float vector using sse intrinsics?
简单的序列号:
void(float *input, float &result, unsigned int NumElems)
{
result = 0;
for(auto i=0; i<NumElems; ++i)
result += input[i];
}
推荐答案
通常,您生成4个部分和在循环中,然后在循环后的4个元素之间进行水平求和,例如
Typically you generate 4 partial sums in your loop and then just sum horizontally across the 4 elements after the loop, e.g.
#include <cassert>
#include <cstdint>
#include <emmintrin.h>
float vsum(const float *a, int n)
{
float sum;
__m128 vsum = _mm_set1_ps(0.0f);
assert((n & 3) == 0);
assert(((uintptr_t)a & 15) == 0);
for (int i = 0; i < n; i += 4)
{
__m128 v = _mm_load_ps(&a[i]);
vsum = _mm_add_ps(vsum, v);
}
vsum = _mm_hadd_ps(vsum, vsum);
vsum = _mm_hadd_ps(vsum, vsum);
_mm_store_ss(&sum, vsum);
return sum;
}
注意:在上述示例中, a
必须对齐16个字节, n
必须为4的倍数。如果 a
的对齐方式可以无法保证,请使用 _mm_loadu_ps
而不是 _mm_load_ps
。如果不能保证 n
是4的倍数,则在函数末尾添加标量循环以累积所有剩余元素。
Note: for the above example a
must be 16 byte aligned and n
must be a multiple of 4. If the alignment of a
can not be guaranteed then use _mm_loadu_ps
instead of _mm_load_ps
. If n
is not guaranteed to be a multiple of 4 then add a scalar loop at the end of the function to accumulate any remaining elements.
这篇关于SSE减少浮点向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!