SSE减少浮点向量 [英] SSE reduction of float vector

查看：93 发布时间：2020/9/26 23:00:15 c++ sum sse simd reduction

本文介绍了SSE减少浮点向量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何使用sse内在函数获取浮点向量的求和元素（归约）？

How can I get sum elements (reduction) of float vector using sse intrinsics?

简单的序列号：

void(float *input, float &result, unsigned int NumElems)
{
     result = 0;
     for(auto i=0; i<NumElems; ++i)
         result += input[i];
}

推荐答案

通常，您生成4个部分和在循环中，然后在循环后的4个元素之间进行水平求和，例如

Typically you generate 4 partial sums in your loop and then just sum horizontally across the 4 elements after the loop, e.g.

#include <cassert>
#include <cstdint>
#include <emmintrin.h>

float vsum(const float *a, int n)
{
    float sum;
    __m128 vsum = _mm_set1_ps(0.0f);
    assert((n & 3) == 0);
    assert(((uintptr_t)a & 15) == 0);
    for (int i = 0; i < n; i += 4)
    {
        __m128 v = _mm_load_ps(&a[i]);
        vsum = _mm_add_ps(vsum, v);
    }
    vsum = _mm_hadd_ps(vsum, vsum);
    vsum = _mm_hadd_ps(vsum, vsum);
    _mm_store_ss(&sum, vsum);
    return sum;
}

注意：在上述示例中， a 必须对齐16个字节， n 必须为4的倍数。如果 a 的对齐方式可以无法保证，请使用 _mm_loadu_ps 而不是 _mm_load_ps 。如果不能保证 n 是4的倍数，则在函数末尾添加标量循环以累积所有剩余元素。

Note: for the above example a must be 16 byte aligned and n must be a multiple of 4. If the alignment of a can not be guaranteed then use _mm_loadu_ps instead of _mm_load_ps. If n is not guaranteed to be a multiple of 4 then add a scalar loop at the end of the function to accumulate any remaining elements.

这篇关于SSE减少浮点向量的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

SSE减少浮点向量 [英] SSE reduction of float vector

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

SSE减少浮点向量 [英] SSE reduction of float vector

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭