使用 openmp 并行 for 循环获得出乎意料的良好性能 [英] Unexpectedly good performance with openmp parallel for loop

查看:20
本文介绍了使用 openmp 并行 for 循环获得出乎意料的良好性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在之前的评论(尤其是@Zboson)之后编辑了我的问题以提高可读性

我一直遵循并观察到传统观点,即 openmp 线程的数量应该与机器上的超线程数量大致匹配,以获得最佳性能.然而,我在我的新笔记本电脑上观察到了奇怪的行为,它配备了英特尔酷睿 i7 4960HQ,4 核 - 8 线程.(请参阅此处的英特尔文档)

I have always acted on, and observed, the conventional wisdom that the number of openmp threads should roughly match the number of hyper-threads on a machine for optimal performance. However, I am observing odd behaviour on my new laptop with Intel Core i7 4960HQ, 4 cores - 8 threads. (See Intel docs here)

这是我的测试代码:

#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

int main() {
    const int n = 256*8192*100;
    double *A, *B;
    posix_memalign((void**)&A, 64, n*sizeof(double));
    posix_memalign((void**)&B, 64, n*sizeof(double));
    for (int i = 0; i < n; ++i) {
        A[i] = 0.1;
        B[i] = 0.0;
    }
    double start = omp_get_wtime();
    #pragma omp parallel for
    for (int i = 0; i < n; ++i) {
        B[i] = exp(A[i]) + sin(B[i]);
    }
    double end = omp_get_wtime();
    double sum = 0.0;
    for (int i = 0; i < n; ++i) {
        sum += B[i];
    }
    printf("%g %g
", end - start, sum);
    return 0;
}

当我使用 gcc 4.9-4.9-20140209 编译它时,使用命令:gcc -Ofast -march=native -std=c99 -fopenmp -Wa,-q 当我更改 OMP_NUM_THREADS 时,我看到以下性能 [点是 5 次运行的平均值,误差线(几乎不可见)是标准偏差]:

When I compile it using gcc 4.9-4.9-20140209, with the command: gcc -Ofast -march=native -std=c99 -fopenmp -Wa,-q I see the following performance as I change OMP_NUM_THREADS [the points are an average of 5 runs, the error bars (which are hardly visible) are the standard deviations]:

当显示为相对于 OMP_NUM_THREADS=1 的加速时,该图更清晰:

The plot is clearer when shown as the speed up with respect to OMP_NUM_THREADS=1:

性能或多或少随着线程数单调增加,即使 omp 线程数大大超过核心和超线程数!由于线程开销,当使用太多线程时,性能通常会下降(至少在我以前的经验中).特别是因为计算应该是 cpu(或至少是内存)绑定的,而不是等待 I/O.

The performance more or less monotonically increases with thread number, even when the the number of omp threads very greatly exceeds the core and also hyper-thread count! Usually the performance should drop off when too many threads are used (at least in my previous experience), due to the threading overhead. Especially as the calculation should be cpu (or at least memory) bound and not waiting on I/O.

更奇怪的是,速度提升了 35 倍!

Even more weirdly, the speed-up is 35 times!

谁能解释一下?

我还使用更小的阵列 8192*4 对此进行了测试,并看到了类似的性能扩展.

I also tested this with much smaller arrays 8192*4, and see similar performance scaling.

以防万一,我在 Mac OS 10.9 上运行(在 bash 下)获得的性能数据:

In case it matters, I am on Mac OS 10.9 and the performance data where obtained by running (under bash):

for i in {1..128}; do
    for k in {1..5}; do
        export OMP_NUM_THREADS=$i;
        echo -ne $i $k "";
        ./a.out;
    done;
done > out

出于好奇,我决定尝试更多的线程.我的操作系统将此限制为 2000.奇怪的结果(加速和低线程开销)不言而喻!

Out of curiosity I decided to try much larger numbers of threads. My OS limits this to 2000. The odd results (both speed up and low thread overhead) speak for themselves!

我在他们的回答中尝试了@Zboson 的最新建议,即将 VZEROUPPER 放在循环中的每个数学函数之前,它确实解决了缩放问题!(它也把单线程代码从22秒发送到了2秒!):

I tried @Zboson latest suggestion in their answer, i.e. putting VZEROUPPER before each math function within the loop, and it did fix the scaling problem! (It also sent the single threaded code from 22 s to 2 s!):

推荐答案

问题很可能是由 clock() 函数引起的.它不会返回 Linux 上的墙壁时间.您应该使用函数 omp_get_wtime().它比时钟更准确,适用于 GCC、ICC 和 MSVC.事实上,即使我不使用 OpenMP,我也将它用于计时代码.

The problem is likely due to the clock() function. It does not return the wall time on Linux. You should use the function omp_get_wtime(). It's more accurate than clock and works on GCC, ICC, and MSVC. In fact I use it for timing code even when I'm not using OpenMP.

我在这里用它测试了你的代码http://coliru.stacked-crooked.com/a/26f4e8c9fdae5cc2

I tested your code with it here http://coliru.stacked-crooked.com/a/26f4e8c9fdae5cc2

编辑:要考虑的另一件事可能导致您的问题是您使用的 expsin 函数是在没有 AVX 的情况下编译的支持.您的代码是使用 AVX 支持(实际上是 AVX2)编译的.如果您使用 -fopenmp -mavx2 -mfma 进行编译,您可以从 GCC explorer 和您的代码中看到这一点 每当你从带有 AVX 的代码调用一个没有 AVX 支持的函数时,你需要将 YMM 寄存器的上半部分归零,否则会付出很大的代价.您可以使用内在的 _mm256_zeroupper (VZEROUPPER) 来做到这一点.Clang 为您执行此操作,但最后我检查了 GCC 没有执行此操作,因此您必须自己执行此操作(请参阅对此问题的评论 数学函数在运行任何英特尔 AVX 函数后需要更多的周期 以及这里的答案 使用 AVX CPU 指令:没有/arch:AVX"的性能不佳).因此,由于没有调用 VZEROUPPER,因此每次迭代都会有很大的延迟.我不确定为什么这对多线程很重要,但如果 GCC 每次启动一个新线程时都这样做,那么它可以帮助解释您所看到的.

Edit: Another thing to consider which may be causing your problem is that exp and sin function which you are using are compiled WITHOUT AVX support. Your code is compiled with AVX support (actually AVX2). You can see this from GCC explorer with your code if you compile with -fopenmp -mavx2 -mfma Whenever you call a function without AVX support from code with AVX you need to zero the upper part of the YMM register or pay a large penalty. You can do this with the intrinsic _mm256_zeroupper (VZEROUPPER). Clang does this for you but last I checked GCC does not so you have to do it yourself (see the comments to this question Math functions takes more cycles after running any intel AVX function and also the answer here Using AVX CPU instructions: Poor performance without "/arch:AVX"). So every iteration you are have a large delay due to not calling VZEROUPPER. I'm not sure why this is what matters with multiple threads but if GCC does this each time it starts a new thread then it could help explain what you are seeing.

#include <immintrin.h>

#pragma omp parallel for
for (int i = 0; i < n; ++i) {
    _mm256_zeroupper();
    B[i] = sin(B[i]);
    _mm256_zeroupper();
    B[i] += exp(A[i]);       
}

编辑 一种更简单的测试方法是,不要使用 -march=native 编译,不要设置 arch (gcc -Ofast -std=c99 -fopenmp -Wa) 或仅使用 SSE2 (gcc -Ofast -msse2 -std=c99 -fopenmp -Wa).

Edit A simpler way to test do this is to instead of compiling with -march=native don't set the arch (gcc -Ofast -std=c99 -fopenmp -Wa) or just use SSE2 (gcc -Ofast -msse2 -std=c99 -fopenmp -Wa).

编辑 GCC 4.8 有一个选项 -mvzeroupper 这可能是最方便的解决方案.

Edit GCC 4.8 has an option -mvzeroupper which may be the most convenient solution.

此选项指示 GCC 在将控制流转移出函数之前发出 vzeroupper 指令,以最小化 AVX 到 SSE 的转换损失并删除不必要的 zeroupper 内在函数.

This option instructs GCC to emit a vzeroupper instruction before a transfer of control flow out of the function to minimize the AVX to SSE transition penalty as well as remove unnecessary zeroupper intrinsics.

这篇关于使用 openmp 并行 for 循环获得出乎意料的良好性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆