在C ++中使用SSE2 SIMD对两个数组求和的正确方法 [英] The correct way to sum two arrays with SSE2 SIMD in C++
问题描述
首先包括以下内容:
#include <vector>
#include <random>
using namespace std;
现在,假设其中一个具有以下三个std:vector<float>
:
Now, suppose that one has the following three std:vector<float>
:
N = 1048576;
vector<float> a(N);
vector<float> b(N);
vector<float> c(N);
default_random_engine randomGenerator(time(0));
uniform_real_distribution<float> diceroll(0.0f, 1.0f);
for(int i-0; i<N; i++)
{
a[i] = diceroll(randomGenerator);
b[i] = diceroll(randomGenerator);
}
现在,假设需要将元素a
和b
求和并将结果存储在c
中,其形式为标量形式如下:
Now, assume that one needs to sum a
and b
element-wise and store the result in c
, which in scalar form looks like the following:
for(int i=0; i<N; i++)
{
c[i] = a[i] + b[i];
}
上面代码的SSE2矢量化版本是什么,请记住输入是如上定义的a
和b
(即作为float
的集合),而输出是c
(也是float
的集合?
What would be the SSE2 vectorized version of the above code, keeping in mind that the inputs are a
and b
as defined above (i.e. as a collection of float
) and ehe output is c
(also a collection of float
)?
经过大量研究,我提出了以下建议:
After quite a bit of research, I was able to come up with the following:
for(int i=0; i<N; i+=4)
{
float a_toload[4] = { a[i], a[i + 1], a[i + 2], a[i + 3] };
float b_toload[4] = { b[i], b[i + 1], b[i + 2], b[i + 3] };
__m128 loaded_a = _mm_loadu_ps(a_toload);
__m128 loaded_b = _mm_loadu_ps(b_toload);
float result[4] = { 0, 0, 0, 0 };
_mm_storeu_ps(result, _mm_add_ps(loaded_a , loaded_b));
c[i] = result[0];
c[i + 1] = result[1];
c[i + 2] = result[2];
c[i + 3] = result[3];
}
但是,这似乎很麻烦并且肯定效率很低:上面的SIMD版本实际上比初始标量版本慢三倍(当然,是在Microsoft VS15的发行模式下以及之后进行优化的情况下测得的)一百万次迭代,而不仅仅是十二次.
However, this seems to be really cumbersome and is certainly quite inefficient: the SIMD version above is actually three times slower than the initial scalar version (measured, of course, with optimizations on, in release mode of Microsoft VS15, and after 1 million iterations, not just 12).
推荐答案
您的for循环可以简化为
Your for loop could be simplified to
const int aligendN = N - N % 4;
for (int i = 0; i < alignedN; i+=4) {
_mm_storeu_ps(&c[i],
_mm_add_ps(_mm_loadu_ps(&a[i]),
_mm_loadu_ps(&b[i])));
}
for (int i = alignedN; i < N; ++i) {
c[i] = a[i] + b[i];
}
一些其他说明:
1,一个处理最后几个浮点数的小循环很常见,并且当N%4 != 0
或N在编译时未知时,这是强制性的.
2,我注意到您选择了未对齐版本的加载/存储,与对齐版本相比,罚款很小.我在stackoverflow上找到了此链接:
Some additional explanation:
1, A small loop handling the last several floats is quit common and when N%4 != 0
or N is unknown at compile time it is mandatory.
2, I notice that you choose unaligned version load/store, there is small penalty compared to aligned version. I found this link at stackoverflow: Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?
这篇关于在C ++中使用SSE2 SIMD对两个数组求和的正确方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!