OpenMP原子_mm_add_pd [英] OpenMP atomic _mm_add_pd

查看：244 发布时间：2020/5/21 1:30:43 c++ openmp intrinsics

本文介绍了OpenMP原子_mm_add_pd的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用OpenMP将已矢量化的代码与内部函数并行化，但是问题是我将一个XMM寄存器用作外部变量"，因此我增加了每个循环.目前，我正在使用shared子句

I'm trying to use OpenMP for parallelization of an already vectorized code with intrinsics, but the problem is that I'm using one XMM register as an outside 'variable' that I increment each loop. For now I'm using the shared clause

__m128d xmm0 = _mm_setzero_pd();
__declspec(align(16)) double res[2];

#pragma omp parallel for shared(xmm0)
for (int i = 0; i < len; i++)
{
    __m128d xmm7 = ... result of some operations

    xmm0 = _mm_add_pd(xmm0, xmm7);
}

_mm_store_pd(res, xmm0);
double final_result = res[0] + res[1];

因为不支持atomic操作(在VS2010中)

because the atomic operation is not supported (in VS2010)

__m128d xmm0 = _mm_setzero_pd();
__declspec(align(16)) double res[2];

#pragma omp parallel for
for (int i = 0; i < len; i++)
{
    __m128d xmm7 = ... result of some operations

    #pragma omp atomic
    xmm0 = _mm_add_pd(xmm0, xmm7);
}

_mm_store_pd(res, xmm0);
double final_result = res[0] + res[1];

有人知道聪明的解决方法吗?

Does anyone know a clever work-around?

我也已经使用并行模式库尝试过它:

I've also tried it using the Parallel Patterns Library just now:

__declspec(align(16)) double res[2];
combinable<__m128d> xmm0_comb([](){return _mm_setzero_pd();});

parallel_for(0, len, 1, [&xmm0_comb, ...](int i)
{
    __m128d xmm7 = ... result of some operations

    __m128d& xmm0 = xmm0_comb.local();
    xmm0 = _mm_add_pd(xmm0, xmm7);
});

__m128d xmm0 = xmm0_comb.combine([](__m128d a, __m128d b){return _mm_add_pd(a, b);});
_mm_store_pd(res, xmm0);
double final_result = res[0] + res[1];

但结果令人失望.

推荐答案

在回答我问题的人们的大力帮助下，我提出了以下建议:

With great help from the people who answered my question I've come up with this:

double final_result = 0.0;

#pragma omp parallel reduction(+:final_result)
{
    __declspec(align(16)) double r[2];
    __m128d xmm0 = _mm_setzero_pd();

    #pragma omp for
    for (int i = 0; i < len; i++)
    {
        __m128d xmm7 = ... result of some operations

        xmm0 = _mm_add_pd(xmm0, xmm7);
    }
    _mm_store_pd(r, xmm0);
    final_result += r[0] + r[1];
}

它基本上将崩溃和缩小分开，表现很好.

It basically separates the collapse and reduction, performs very well.

非常感谢所有帮助过我的人！

Many thanks to all who have helped me!

这篇关于OpenMP原子_mm_add_pd的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

OpenMP原子_mm_add_pd [英] OpenMP atomic _mm_add_pd

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

OpenMP原子_mm_add_pd [英] OpenMP atomic _mm_add_pd

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭