OpenMP原子_mm_add_pd [英] OpenMP atomic _mm_add_pd

查看:244
本文介绍了OpenMP原子_mm_add_pd的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用OpenMP将已矢量化的代码与内部函数并行化,但是问题是我将一个XMM寄存器用作外部变量",因此我增加了每个循环.目前,我正在使用shared子句

I'm trying to use OpenMP for parallelization of an already vectorized code with intrinsics, but the problem is that I'm using one XMM register as an outside 'variable' that I increment each loop. For now I'm using the shared clause

__m128d xmm0 = _mm_setzero_pd();
__declspec(align(16)) double res[2];

#pragma omp parallel for shared(xmm0)
for (int i = 0; i < len; i++)
{
    __m128d xmm7 = ... result of some operations

    xmm0 = _mm_add_pd(xmm0, xmm7);
}

_mm_store_pd(res, xmm0);
double final_result = res[0] + res[1];

因为不支持atomic操作(在VS2010中)

because the atomic operation is not supported (in VS2010)

__m128d xmm0 = _mm_setzero_pd();
__declspec(align(16)) double res[2];

#pragma omp parallel for
for (int i = 0; i < len; i++)
{
    __m128d xmm7 = ... result of some operations

    #pragma omp atomic
    xmm0 = _mm_add_pd(xmm0, xmm7);
}

_mm_store_pd(res, xmm0);
double final_result = res[0] + res[1];

有人知道聪明的解决方法吗?

Does anyone know a clever work-around?

我也已经使用并行模式库尝试过它:

I've also tried it using the Parallel Patterns Library just now:

__declspec(align(16)) double res[2];
combinable<__m128d> xmm0_comb([](){return _mm_setzero_pd();});

parallel_for(0, len, 1, [&xmm0_comb, ...](int i)
{
    __m128d xmm7 = ... result of some operations

    __m128d& xmm0 = xmm0_comb.local();
    xmm0 = _mm_add_pd(xmm0, xmm7);
});

__m128d xmm0 = xmm0_comb.combine([](__m128d a, __m128d b){return _mm_add_pd(a, b);});
_mm_store_pd(res, xmm0);
double final_result = res[0] + res[1];

但结果令人失望.

推荐答案

在回答我问题的人们的大力帮助下,我提出了以下建议:

With great help from the people who answered my question I've come up with this:

double final_result = 0.0;

#pragma omp parallel reduction(+:final_result)
{
    __declspec(align(16)) double r[2];
    __m128d xmm0 = _mm_setzero_pd();

    #pragma omp for
    for (int i = 0; i < len; i++)
    {
        __m128d xmm7 = ... result of some operations

        xmm0 = _mm_add_pd(xmm0, xmm7);
    }
    _mm_store_pd(r, xmm0);
    final_result += r[0] + r[1];
}

它基本上将崩溃和缩小分开,表现很好.

It basically separates the collapse and reduction, performs very well.

非常感谢所有帮助过我的人!

Many thanks to all who have helped me!

这篇关于OpenMP原子_mm_add_pd的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆