减少与SSE / AVX的OpenMP [英] reduction with OpenMP with SSE/AVX
问题描述
我想要做的使用OpenMP和SIMD阵列上的减少。我读了OpenMP中减少等同于:
I want to do a reduction on an array using OpenMP and SIMD. I read that a reduction in OpenMP is equivalent to:
inline float sum_scalar_openmp2(const float a[], const size_t N) {
float sum = 0.0f;
#pragma omp parallel
{
float sum_private = 0.0f;
#pragma omp parallel for nowait
for(int i=0; i<N; i++) {
sum_private += a[i];
}
#pragma omp atomic
sum += sum_private;
}
return sum;
}
我从以下网站这样的想法:
http://bisqwit.iki.fi/story/howto/openmp/#ReductionClause一>
但原子也不支持复杂运算。我所做的是取代原子与关键和实施与OpenMP和SSE这样的减少:
I got this idea from the follow link: http://bisqwit.iki.fi/story/howto/openmp/#ReductionClause But atomic also does not support complex operators. What I did was replace atomic with critical and implemented the reduction with OpenMP and SSE like this:
#define ROUND_DOWN(x, s) ((x) & ~((s)-1))
inline float sum_vector4_openmp(const float a[], const size_t N) {
__m128 sum4 = _mm_set1_ps(0.0f);
#pragma omp parallel
{
__m128 sum4_private = _mm_set1_ps(0.0f);
#pragma omp for nowait
for(int i=0; i < ROUND_DOWN(N, 4); i+=4) {
__m128 a4 = _mm_load_ps(a + i);
sum4_private = _mm_add_ps(a4, sum4_private);
}
#pragma omp critical
sum4 = _mm_add_ps(sum4_private, sum4);
}
__m128 t1 = _mm_hadd_ps(sum4,sum4);
__m128 t2 = _mm_hadd_ps(t1,t1);
float sum = _mm_cvtss_f32(t2);
for(int i = ROUND_DOWN(N, 4); i < N; i++) {
sum += a[i];
}
return sum;
}
但是,此功能不执行,以及我的希望。我使用Visual Studio 2012中前preSS。我知道我可以通过展开上证所负载提高性能有点/加了几次,但仍低于预期的结果。
However, this function does not perform as well as I hope. I'm using Visual Studio 2012 Express. I know I can improve the performance a bit by unrolling the SSE load/add a few times but that still is less than I expect.
我通过在等于线程数阵列的片运行获得更好的性能
I get much better performance by running over slices of the arrays equal to the number of threads:
inline float sum_slice(const float a[], const size_t N) {
int nthreads = 4;
const int offset = ROUND_DOWN(N/nthreads, nthreads);
float suma[8] = {0};
#pragma omp parallel for num_threads(nthreads)
for(int i=0; i<nthreads; i++) {
suma[i] = sum_vector4(&a[i*offset], offset);
}
float sum = 0.0f;
for(int i=0; i<nthreads; i++) {
sum += suma[i];
}
for(int i=nthreads*offset; i < N; i++) {
sum += a[i];
}
return sum;
}
inline float sum_vector4(const float a[], const size_t N) {
__m128 sum4 = _mm_set1_ps(0.0f);
int i = 0;
for(; i < ROUND_DOWN(N, 4); i+=4) {
__m128 a4 = _mm_load_ps(a + i);
sum4 = _mm_add_ps(sum4, a4);
}
__m128 t1 = _mm_hadd_ps(sum4,sum4);
__m128 t2 = _mm_hadd_ps(t1,t1);
float sum = _mm_cvtss_f32(t2);
for(; i < N; i++) {
sum += a[i];
}
return sum;
}
有人不知道是否有OpenMP中做更复杂的运营商削减一个更好的办法?
Does someone know if there is a better way of doing reductions with more complicated operators in OpenMP?
推荐答案
我想回答你的问题是,我不认为这是做还原与OpenMP的更复杂的运营商更好的方法号。
I guess the answer to your question is No. I don't think there is a better way of doing reduction with more complicated operators in OpenMP.
假设数组16位排列,OpenMP的线程数为4,人们可能预期性能增益为12倍 - 通过的OpenMP + SIMD 16X。在现实中,可能不会产生足够的性能增益,因为
Assuming that the array is 16 bit aligned, number of openmp threads is 4, one might expect the performance gain to be 12x - 16x by OpenMP + SIMD. In realistic, it might not produce enough performance gain because
- 有创造OpenMP的线程开销。
- 的code是做1负载运行1加法运算。因此,CPU做得不够计算。因此,它几乎看起来像CPU花费大部分时间都在加载数据,类型的内存带宽的限制。
这篇关于减少与SSE / AVX的OpenMP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!