用openmp并行for循环意外的性能很好 [英] Unexpectedly good performance with openmp parallel for loop

查看:1601
本文介绍了用openmp并行for循环意外的性能很好的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了更好的可读性,我在之前的评论(尤其是@Zboson)之后编辑了我的问题 p

我一直采取行动,传统观点认为,openmp线程的数量应该与机器上超线程的数量大致匹配以获得最佳性能。但是,我在使用英特尔酷睿i7 4960HQ,4核 - 8线程的新笔记本电脑上观察到奇怪的行为。 (请参阅



编辑:另一个可能导致问题的因素是 exp sin code>您正在使用的功能是在没有AVX支持的情况下编译的。您的代码是用AVX支持(实际上是AVX2)编译的。如果使用 -fopenmp编译,您可以使用代码从 GCC explorer 中查看此代码, mavx2 -mfma 无论您何时通过AVX代码调用没有AVX支持的函数,您都需要将YMM寄存器的上半部分清零或付出较大的代价。你可以用内部 _mm256_zeroupper (VZEROUPPER)来做到这一点。铿锵为你做了这个,但最后我检查了GCC并没有,所以你必须自己做(参见这个问题的评论运行任何intel-avx-function /问题/ 7839925 / using-avx-cpu-instructions-poor-performance-without-archavx>使用AVX CPU指令:没有使用/ arch:AVX的性能较差)。所以每次迭代由于没有调用VZEROUPPER而有很大的延迟。我不确定为什么这对多线程很重要,但如果GCC每次启动一个新线程时都会这样做,那么它可以帮助解释你看到的内容。

  #include< immintrin.h> 

#pragma omp parallel for
for(int i = 0; i< n; ++ i){
_mm256_zeroupper();
B [i] = sin(B [i]);
_mm256_zeroupper();
B [i] + = exp(A [i]);
}

编辑简单的测试方法是而不是使用 -march = native 编译,不要设置arch( gcc -Ofast -std = c99 -fopenmp -Wa gcc -Ofast -msse2 -std = c99 -fopenmp -Wa )。



编辑 GCC 4.8 有一个选项 -mvzeroupper ,这可能是最方便的解决方案。


该选项指示GCC在将控制流转移出函数之前发出vzeroupper指令,以最小化AVX到SSE转换惩罚以及移除不必要的零点内部函数。



I have edited my question after previous comments (especially @Zboson) for better readability

I have always acted on, and observed, the conventional wisdom that the number of openmp threads should roughly match the number of hyper-threads on a machine for optimal performance. However, I am observing odd behaviour on my new laptop with Intel Core i7 4960HQ, 4 cores - 8 threads. (See Intel docs here)

Here is my test code:

#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

int main() {
    const int n = 256*8192*100;
    double *A, *B;
    posix_memalign((void**)&A, 64, n*sizeof(double));
    posix_memalign((void**)&B, 64, n*sizeof(double));
    for (int i = 0; i < n; ++i) {
        A[i] = 0.1;
        B[i] = 0.0;
    }
    double start = omp_get_wtime();
    #pragma omp parallel for
    for (int i = 0; i < n; ++i) {
        B[i] = exp(A[i]) + sin(B[i]);
    }
    double end = omp_get_wtime();
    double sum = 0.0;
    for (int i = 0; i < n; ++i) {
        sum += B[i];
    }
    printf("%g %g\n", end - start, sum);
    return 0;
}

When I compile it using gcc 4.9-4.9-20140209, with the command: gcc -Ofast -march=native -std=c99 -fopenmp -Wa,-q I see the following performance as I change OMP_NUM_THREADS [the points are an average of 5 runs, the error bars (which are hardly visible) are the standard deviations]:

The plot is clearer when shown as the speed up with respect to OMP_NUM_THREADS=1:

The performance more or less monotonically increases with thread number, even when the the number of omp threads very greatly exceeds the core and also hyper-thread count! Usually the performance should drop off when too many threads are used (at least in my previous experience), due to the threading overhead. Especially as the calculation should be cpu (or at least memory) bound and not waiting on I/O.

Even more weirdly, the speed-up is 35 times!

Can anyone explain this?

I also tested this with much smaller arrays 8192*4, and see similar performance scaling.

In case it matters, I am on Mac OS 10.9 and the performance data where obtained by running (under bash):

for i in {1..128}; do
    for k in {1..5}; do
        export OMP_NUM_THREADS=$i;
        echo -ne $i $k "";
        ./a.out;
    done;
done > out

EDIT: Out of curiosity I decided to try much larger numbers of threads. My OS limits this to 2000. The odd results (both speed up and low thread overhead) speak for themselves!

EDIT: I tried @Zboson latest suggestion in their answer, i.e. putting VZEROUPPER before each math function within the loop, and it did fix the scaling problem! (It also sent the single threaded code from 22 s to 2 s!):

解决方案

The problem is likely due to the clock() function. It does not return the wall time on Linux. You should use the function omp_get_wtime(). It's more accurate than clock and works on GCC, ICC, and MSVC. In fact I use it for timing code even when I'm not using OpenMP.

I tested your code with it here http://coliru.stacked-crooked.com/a/26f4e8c9fdae5cc2

Edit: Another thing to consider which may be causing your problem is that exp and sin function which you are using are compiled WITHOUT AVX support. Your code is compiled with AVX support (actually AVX2). You can see this from GCC explorer with your code if you compile with -fopenmp -mavx2 -mfma Whenever you call a function without AVX support from code with AVX you need to zero the upper part of the YMM register or pay a large penalty. You can do this with the intrinsic _mm256_zeroupper (VZEROUPPER). Clang does this for you but last I checked GCC does not so you have to do it yourself (see the comments to this question Math functions takes more cycles after running any intel AVX function and also the answer here Using AVX CPU instructions: Poor performance without "/arch:AVX"). So every iteration you are have a large delay due to not calling VZEROUPPER. I'm not sure why this is what matters with multiple threads but if GCC does this each time it starts a new thread then it could help explain what you are seeing.

#include <immintrin.h>

#pragma omp parallel for
for (int i = 0; i < n; ++i) {
    _mm256_zeroupper();
    B[i] = sin(B[i]);
    _mm256_zeroupper();
    B[i] += exp(A[i]);       
}

Edit A simpler way to test do this is to instead of compiling with -march=native don't set the arch (gcc -Ofast -std=c99 -fopenmp -Wa) or just use SSE2 (gcc -Ofast -msse2 -std=c99 -fopenmp -Wa).

Edit GCC 4.8 has an option -mvzeroupper which may be the most convenient solution.

This option instructs GCC to emit a vzeroupper instruction before a transfer of control flow out of the function to minimize the AVX to SSE transition penalty as well as remove unnecessary zeroupper intrinsics.

这篇关于用openmp并行for循环意外的性能很好的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆