OpenMP原子_mm_add_pd [英] OpenMP atomic _mm_add_pd
问题描述
我正在尝试使用OpenMP将已矢量化的代码与内部函数并行化,但是问题是我将一个XMM寄存器用作外部变量",因此我增加了每个循环.目前,我正在使用shared
子句
I'm trying to use OpenMP for parallelization of an already vectorized code with intrinsics, but the problem is that I'm using one XMM register as an outside 'variable' that I increment each loop. For now I'm using the shared
clause
__m128d xmm0 = _mm_setzero_pd();
__declspec(align(16)) double res[2];
#pragma omp parallel for shared(xmm0)
for (int i = 0; i < len; i++)
{
__m128d xmm7 = ... result of some operations
xmm0 = _mm_add_pd(xmm0, xmm7);
}
_mm_store_pd(res, xmm0);
double final_result = res[0] + res[1];
因为不支持atomic
操作(在VS2010中)
because the atomic
operation is not supported (in VS2010)
__m128d xmm0 = _mm_setzero_pd();
__declspec(align(16)) double res[2];
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
__m128d xmm7 = ... result of some operations
#pragma omp atomic
xmm0 = _mm_add_pd(xmm0, xmm7);
}
_mm_store_pd(res, xmm0);
double final_result = res[0] + res[1];
有人知道聪明的解决方法吗?
Does anyone know a clever work-around?
我也已经使用并行模式库尝试过它:
I've also tried it using the Parallel Patterns Library just now:
__declspec(align(16)) double res[2];
combinable<__m128d> xmm0_comb([](){return _mm_setzero_pd();});
parallel_for(0, len, 1, [&xmm0_comb, ...](int i)
{
__m128d xmm7 = ... result of some operations
__m128d& xmm0 = xmm0_comb.local();
xmm0 = _mm_add_pd(xmm0, xmm7);
});
__m128d xmm0 = xmm0_comb.combine([](__m128d a, __m128d b){return _mm_add_pd(a, b);});
_mm_store_pd(res, xmm0);
double final_result = res[0] + res[1];
但结果令人失望.
推荐答案
在回答我问题的人们的大力帮助下,我提出了以下建议:
With great help from the people who answered my question I've come up with this:
double final_result = 0.0;
#pragma omp parallel reduction(+:final_result)
{
__declspec(align(16)) double r[2];
__m128d xmm0 = _mm_setzero_pd();
#pragma omp for
for (int i = 0; i < len; i++)
{
__m128d xmm7 = ... result of some operations
xmm0 = _mm_add_pd(xmm0, xmm7);
}
_mm_store_pd(r, xmm0);
final_result += r[0] + r[1];
}
它基本上将崩溃和缩小分开,表现很好.
It basically separates the collapse and reduction, performs very well.
非常感谢所有帮助过我的人!
Many thanks to all who have helped me!
这篇关于OpenMP原子_mm_add_pd的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!