减少与SSE / AVX的OpenMP [英] reduction with OpenMP with SSE/AVX

查看：509 发布时间：2016/8/18 14:19:21 c openmp sse avx

本文介绍了减少与SSE / AVX的OpenMP的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想要做的使用OpenMP和SIMD阵列上的减少。我读了OpenMP中减少等同于：

I want to do a reduction on an array using OpenMP and SIMD. I read that a reduction in OpenMP is equivalent to:

inline float sum_scalar_openmp2(const float a[], const size_t N) {
    float sum = 0.0f;
    #pragma omp parallel
    {
        float sum_private = 0.0f;
        #pragma omp parallel for nowait
        for(int i=0; i<N; i++) {
            sum_private += a[i];
        }
        #pragma omp atomic
        sum += sum_private;
    }
    return sum;
}

我从以下网站这样的想法：
http://bisqwit.iki.fi/story/howto/openmp/#ReductionClause
但原子也不支持复杂运算。我所做的是取代原子与关键和实施与OpenMP和SSE这样的减少：

I got this idea from the follow link: http://bisqwit.iki.fi/story/howto/openmp/#ReductionClause But atomic also does not support complex operators. What I did was replace atomic with critical and implemented the reduction with OpenMP and SSE like this:

#define ROUND_DOWN(x, s) ((x) & ~((s)-1))
inline float sum_vector4_openmp(const float a[], const size_t N) {
    __m128 sum4 = _mm_set1_ps(0.0f);
    #pragma omp parallel 
    {
        __m128 sum4_private = _mm_set1_ps(0.0f);
        #pragma omp for nowait
        for(int i=0; i < ROUND_DOWN(N, 4); i+=4) {
            __m128 a4 = _mm_load_ps(a + i);
            sum4_private = _mm_add_ps(a4, sum4_private);
        }
        #pragma omp critical
        sum4 = _mm_add_ps(sum4_private, sum4);
    }
    __m128 t1 = _mm_hadd_ps(sum4,sum4);
    __m128 t2 = _mm_hadd_ps(t1,t1);
    float sum = _mm_cvtss_f32(t2);  
    for(int i = ROUND_DOWN(N, 4); i < N; i++) {
        sum += a[i];
    }
    return sum;
}

但是，此功能不执行，以及我的希望。我使用Visual Studio 2012中前preSS。我知道我可以通过展开上证所负载提高性能有点/加了几次，但仍低于预期的结果。

However, this function does not perform as well as I hope. I'm using Visual Studio 2012 Express. I know I can improve the performance a bit by unrolling the SSE load/add a few times but that still is less than I expect.

我通过在等于线程数阵列的片运行获得更好的性能

I get much better performance by running over slices of the arrays equal to the number of threads:

inline float sum_slice(const float a[], const size_t N) {
    int nthreads = 4;
    const int offset = ROUND_DOWN(N/nthreads, nthreads);
    float suma[8] = {0};
    #pragma omp parallel for num_threads(nthreads) 
    for(int i=0; i<nthreads; i++) {
        suma[i] = sum_vector4(&a[i*offset], offset);
    }
    float sum = 0.0f;
    for(int i=0; i<nthreads; i++) {
        sum += suma[i]; 
    }
    for(int i=nthreads*offset; i < N; i++) {
        sum += a[i];
    }
    return sum;    
}

inline float sum_vector4(const float a[], const size_t N) {
    __m128 sum4 = _mm_set1_ps(0.0f);
    int i = 0;
    for(; i < ROUND_DOWN(N, 4); i+=4) {
        __m128 a4 = _mm_load_ps(a + i);
        sum4 = _mm_add_ps(sum4, a4);
    }
    __m128 t1 = _mm_hadd_ps(sum4,sum4);
    __m128 t2 = _mm_hadd_ps(t1,t1);
    float sum = _mm_cvtss_f32(t2);
    for(; i < N; i++) {
        sum += a[i];
    }
    return sum;

}

有人不知道是否有OpenMP中做更复杂的运营商削减一个更好的办法？

Does someone know if there is a better way of doing reductions with more complicated operators in OpenMP?

减少与SSE / AVX的OpenMP [英] reduction with OpenMP with SSE/AVX

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录关闭

减少与SSE / AVX的OpenMP [英] reduction with OpenMP with SSE/AVX

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录 关闭

登录关闭