在 C++ 中使用 SSE2 SIMD 对两个数组求和的正确方法 [英] The correct way to sum two arrays with SSE2 SIMD in C++

查看：36 发布时间：2022/1/9 16:37:22 c++ arrays sum sse simd

本文介绍了在 C++ 中使用 SSE2 SIMD 对两个数组求和的正确方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

让我们从包括以下内容开始:

Let's start by including the following:

#include <vector>
#include <random>
using namespace std;

现在，假设有以下三个std:vector:


Now, suppose that one has the following three std:vector<float>:
N = 1048576;
vector<float> a(N);
vector<float> b(N);
vector<float> c(N);

default_random_engine randomGenerator(time(0));
uniform_real_distribution<float> diceroll(0.0f, 1.0f);
for(int i-0; i<N; i++)
{
    a[i] = diceroll(randomGenerator);
    b[i] = diceroll(randomGenerator);
}

现在，假设需要按元素对 a 和 b 求和并将结果存储在 c 中，它看起来是标量形式像下面这样:
Now, assume that one needs to sum a and b element-wise and store the result in c, which in scalar form looks like the following:
for(int i=0; i<N; i++)
{
    c[i] = a[i] + b[i];
}

上面代码的 SSE2 矢量化版本是什么，记住输入是上面定义的 a 和 b(即作为 float) 并且输出是 c (也是 float 的集合)?
What would be the SSE2 vectorized version of the above code, keeping in mind that the inputs are a and b as defined above (i.e. as a collection of float) and ehe output is c (also a collection of float)?
经过大量研究，我得出以下结论:
After quite a bit of research, I was able to come up with the following:
for(int i=0; i<N; i+=4)
{
    float a_toload[4] = { a[i], a[i + 1], a[i + 2], a[i + 3] };
    float b_toload[4] = { b[i], b[i + 1], b[i + 2], b[i + 3] };
    __m128 loaded_a = _mm_loadu_ps(a_toload);
    __m128 loaded_b = _mm_loadu_ps(b_toload);

    float result[4] = { 0, 0, 0, 0 };
    _mm_storeu_ps(result, _mm_add_ps(loaded_a , loaded_b));
    c[i] = result[0];
    c[i + 1] = result[1];
    c[i + 2] = result[2];
    c[i + 3] = result[3];
}

但是，这似乎真的很麻烦，而且效率肯定很低:上面的 SIMD 版本实际上比初始标量版本慢了三倍(当然，在 Microsoft VS15 的发布模式下和之后进行了优化测量)100 万次迭代，而不仅仅是 12 次).
However, this seems to be really cumbersome and is certainly quite inefficient: the SIMD version above is actually three times slower than the initial scalar version (measured, of course, with optimizations on, in release mode of Microsoft VS15, and after 1 million iterations, not just 12).
推荐答案
你的for循环可以简化为
Your for-loop could be simplified to
const int aligendN = N - N % 4;
for (int i = 0; i < alignedN; i+=4) {
    _mm_storeu_ps(&c[i], 
                  _mm_add_ps(_mm_loadu_ps(&a[i]), 
                  _mm_loadu_ps(&b[i])));
}
for (int i = alignedN; i < N; ++i) {
    c[i] = a[i] + b[i];
}

一些补充说明:
处理最后几个浮点数的小循环很常见，当 N%4 != 0 或 N 在编译时未知时，它是强制性的.
我注意到您选择了未对齐的版本加载/存储，与对齐的版本相比，惩罚很小.我在 stackoverflow 找到了这个链接:在 x64_64 Intel CPU 上，SSE 非对齐负载内在是否比对齐负载内在慢?


A small loop handling the last several floats is quite common and when N%4 != 0 or N is unknown at compile time it is mandatory.
I notice that you choose unaligned version load/store, there is small penalty compared to aligned version. I found this link at stackoverflow:  Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?


                        这篇关于在 C++ 中使用 SSE2 SIMD 对两个数组求和的正确方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

在 C++ 中使用 SSE2 SIMD 对两个数组求和的正确方法 [英] The correct way to sum two arrays with SSE2 SIMD in C++

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

在 C++ 中使用 SSE2 SIMD 对两个数组求和的正确方法 [英] The correct way to sum two arrays with SSE2 SIMD in C++

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭