OMP_NUM_THREADS=1 时 #pragma omp atomic 的性能问题 [英] Performance issues of #pragma omp atomic with OMP_NUM_THREADS=1

查看:200
本文介绍了OMP_NUM_THREADS=1 时 #pragma omp atomic 的性能问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我观察到了我正在编写的 openmp 代码的意外(对我而言!)行为.代码结构如下:

I have observed an unexpected (for me!) behavior of an openmp code which I am writing. The code structure is the following:

#pragma omp parallel for
for(int i=0;i<N;i++){ 
 // lots of calculations that produce 3 integers i1,i2,i3 and 3 doubles d1,d2,d3 
 #pragma omp atomic 
 J1[i1] += d1;
 #pragma omp atomic
 J2[i2] += d2; 
 #pragma omp atomic
 J3[i3] += d3; 
}

我已经编译了这段代码的三个不同版本:

I have compiled three different versions of this code:

1) 使用 openmp (-fopenmp)

1) with openmp (-fopenmp)

2) 没有 openmp

2) without openmp

3) 使用 openmp,但没有 3 个原子操作(作为测试,因为原子操作是必需的)

3) with openmp, but without the 3 atomic operations (just as a test, since atomic operations are necessary)

当我使用环境变量 OMP_NUM_THREADS=1 运行版本 1) 时,我观察到相对于版本 2) 的显着放缓;而版本 3) 的运行速度与版本 2) 一样快.

When I run version 1) with environment variable OMP_NUM_THREADS=1, I observe a significant slowdown with respect to version 2); while version 3) runs as fast as version 2).

我想知道这种行为的原因(为什么原子操作即使是单线程也会减慢代码的速度?!)以及是否可以以版本 1) 运行的方式编译/重写代码与版本 2 一样快).

I would like to know the reason for this behavior (why do atomic operations slow the code down even if single-threded?!) and if it is possible to compile/rewrite the code in such a way that version 1) runs as fast as version 2).

我在问题的末尾附上了一个显示上述行为的工作示例.我编译 1) 与:

I attach at the end of the question a working example which shows the aforementioned behavior. I compiled 1) with:

g++ -fopenmp -o toy_code toy_code.cpp -std=c++11 -O3

2) 与:

g++ -o toy_code_NO_OMP toy_code.cpp -std=c++11 -O3

和 3) 与:

g++ -fopenmp -o toy_code_NO_ATOMIC toy_code_NO_ATOMIC.cpp -std=c++11 -O3

编译器版本为 gcc version 5.3.1 20160519 (Debian 5.3.1-20).3个版本的执行时间为:

The version of the compiler is gcc version 5.3.1 20160519 (Debian 5.3.1-20). The execution time of the 3 versions is:

1) 1 分 24 秒

1) 1 min 24 sec

2) 51 秒

3) 51 秒

提前感谢您的建议!

// toy_code.cpp 
#include <stdio.h>
#include <iostream>
#include <stdlib.h>
#include <cmath>
#include <omp.h>
#define Np 1000000
#define N 1000

int main (){
        double* Xp, *Yp, *J,*Jb;
        Xp = new double[Np];
        Yp = new double[Np];  
        J = new double [N*N];
        Jb = new double [N*N];

        for(int i=0;i<N*N;i++){
            J[i]=0.0;
            Jb[i]=0.0;
        }

        for(int i=0;i<Np;i++){
            Xp[i] = rand()*1.0/RAND_MAX - 0.5;
            Yp[i] = rand()*1.0/RAND_MAX - 0.5;
        }

        for(int n=0; n<2000; n++){
        #pragma omp parallel for
        for(int p=0;p<Np;p++){
            double rx = (Xp[p]+0.5)*(N-1);
            double ry = (Yp[p]+0.5)*(N-1);
            int xindex = (int)floor(rx+0.5);
            int yindex = (int)floor(ry+0.5);
            int k;
            k=xindex*N+yindex;

            #pragma omp atomic
            J[k]+=1;
            #pragma omp atomic
            Jb[k]+=1;
         }
         }

        delete[] Xp;
        delete[] Yp;
        delete[] J;
        delete[] Jb;

return 0;
}

推荐答案

如果启用 OpenMP,gcc 必须生成不同的代码,这些代码适用于仅在运行时已知的任意数量的线程.

If you enable OpenMP, gcc has to generate different code that works for any number of threads that is only known at runtime.

在这种特殊情况下,看看 gcc -S 的输出(被标签稍微缩短了).

In this particular case take a look at output of gcc -S (slightly shortened by lables).

没有 OpenMP:

.loc 1 38 0 discriminator 2  # Line 38 is J[k]+=1;
movsd   8(%rsp), %xmm1
cvttsd2si   %xmm0, %edx
cvttsd2si   %xmm1, %eax
movsd   .LC3(%rip), %xmm0
imull   $1000, %eax, %eax
addl    %edx, %eax
cltq
salq    $3, %rax
leaq    0(%r13,%rax), %rdx
.loc 1 40 0 discriminator 2   # Line 40 is Jb[k]+=1;
addq    %r12, %rax
.loc 1 29 0 discriminator 2
cmpq    $8000000, %r15
.loc 1 38 0 discriminator 2
addsd   (%rdx), %xmm0
movsd   %xmm0, (%rdx)
.loc 1 40 0 discriminator 2
movsd   .LC3(%rip), %xmm0
addsd   (%rax), %xmm0
movsd   %xmm0, (%rax)

循环展开使这变得相当复杂.

The loop is unrolled making this rather complicated.

使用-fopenmp:

movsd   (%rsp), %xmm2
cvttsd2si   %xmm0, %eax
cvttsd2si   %xmm2, %ecx
imull   $1000, %ecx, %ecx
addl    %eax, %ecx
movslq  %ecx, %rcx
salq    $3, %rcx
movq    %rcx, %rsi
addq    16(%rbp), %rsi
movq    (%rsi), %rdx
movsd   8(%rsp), %xmm1
jmp .L4
movq    %rax, %rdx
movq    %rdx, (%rsp)
movq    %rdx, %rax
movsd   (%rsp), %xmm3
addsd   %xmm1, %xmm3
movq    %xmm3, %rdi
lock cmpxchgq   %rdi, (%rsi)
cmpq    %rax, %rdx
jne .L9
.loc 1 40 0
addq    24(%rbp), %rcx
movq    (%rcx), %rdx
jmp .L5
.p2align 4,,10
.p2align 3
movq    %rax, %rdx
movq    %rdx, (%rsp)
movq    %rdx, %rax
movsd   (%rsp), %xmm4
addsd   %xmm1, %xmm4
movq    %xmm4, %rsi
lock cmpxchgq   %rsi, (%rcx)
cmpq    %rax, %rdx
jne .L10
addq    $8, %r12
cmpq    %r12, %rbx
jne .L6

我不打算解释或理解这里发生的所有细节,但这对于消息来说不是必需的:编译器必须使用可能更昂贵的不同原子指令,尤其是 lockcmpxchgq.

I'm not going to try to explain or understand all the details of what is happening here, but thats not necessary for the message: The compiler has to use different atomic instructions that are likely more costly, especially lock cmpxchgq.

除了这个基本问题之外,OpenMP 可能会以任何可以想象的方式干扰优化器,例如干扰展开.我还看到了一个奇怪的案例,英特尔编译器实际上为 OpenMP 循环生成了更高效的串行代码.

Besides this fundamental issue, OpenMP may mess with the optimizer in any imaginable way, e.g. interfer with unrolling. I've also seen a curious case where the intel compiler actually generates more efficient serial code for an OpenMP loop.

附言认为自己很幸运——情况可能会更糟.如果编译器无法将原子指令映射到硬件指令,则必须使用锁,这会更慢.

P.S. Consider yourself lucky - it could be much worse. If the compiler cannot map the atomic instruction to a hardware instruction, it has to use locks which would be even slower.

这篇关于OMP_NUM_THREADS=1 时 #pragma omp atomic 的性能问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆