OpenMP较慢的代码如何并行化? [英] Slower code with OpenMP how it can be parallelized?

查看：125 发布时间：2020/5/21 1:25:34 c++ performance parallel-processing openmp icc

本文介绍了OpenMP较慢的代码如何并行化?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用OpenMP时，此代码的速度较慢.如果没有OpenMP，我大约要花10秒钟的时间.有了OpenMP，我大约要40多岁.怎么了?非常感谢您的朋友！

This code is slower with OpenMP. Without OpenMP I get about 10s. With OpenMP i get about 40s. What is happening? Thank you very much friends!

for (i=2;i<(nnoib-2);++i){
    #pragma omp parallel for
    for (j=2; j<(nnojb-2); ++j) {
        C[i][j]= absi[i]*absj[j]*
                 (2.0f*B[i][j] + absi[i]*absj[j]*
                 (VEL[i][j]*VEL[i][j]*fat*
                 (16.0f*(B[i][j-1]+B[i][j+1]+B[i-1][j]+B[i+1][j])
                 -1.0f*(B[i][j-2]+B[i][j+2]+B[i-2][j]+B[i+2][j]) 
                 -60.0f*B[i][j]
                 )-A[i][j]));
        c2 = (abs(C[i][j]) > Amax[i][j]);
        if (c2) {
            Amax[i][j] = abs(C[i][j]);
            Ttra[i][j] = t;
        }
    }
}

推荐答案

仅因为您使用的是OpenMP，并不意味着您的程序将运行得更快.这里可能发生几件事:

Just because you're using OpenMP doesn't mean your program will run faster. A couple of things can be happening here:

产生每个线程都有一定的成本，如果您产生一个线程进行少量计算，则产生该线程本身会比计算花费更多的时间.

There is a cost associated to spawning each thread, and if you spawn a thread to do a small amount of computation, the spawning of the thread itself will take more time than the computation.

默认情况下，OpenMP将产生CPU支持的最大线程数.使用每个内核支持2个或更多线程的CPU，这些线程将争夺每个内核的资源.使用omp_get_num_threads()，您可以查看默认情况下将产生多少个线程.我建议尝试使用omp_set_num_threads()以一半的值运行代码.

By default, OpenMP will spawn the maximum number of threads supported by your CPU. With CPU's that support 2 or more threads per core, the threads will be competing for each core's resources. Using omp_get_num_threads() you can see how many threads will be spawned by default. I recommend trying running your code with half that value using omp_set_num_threads().

您是否确认使用OpenMP和不使用OpenMP的结果相同?似乎变量j和c2存在依赖关系.您应该将它们声明为每个线程私有:

Did you confirm the results were the same with and without OpenMP? It seems there is a dependency with the variables j and c2. You should declare them private to each thread:

#pragma omp parallel for private(j,c2)

我想添加另一件事:尝试任何并行化之前，您应确保代码已经优化.

I wanted to add another thing: before attempting any parallelization, you should make sure that the code is already optimized.

取决于您的编译器，编译器标志和指令的复杂性，编译器可能会或可能不会优化您的代码:

Depending on your compiler, compiler flags and the complexity of the instruction, the compiler may or may not optimize your code:

// avoid calculation nnoib-2 every iteration
int t_nnoib = nnoib - 2;
for (i=2; i< t_nnoib; ++i){
    // avoid calculation nnojb-2 every iteration
    int t_nnojb = nnojb - 2;
    // avoid loading absi[i] every iteration
    int t_absi = absi[i];
    for (j=2; j< t_nnojb; ++j) {
        C[i][j]= t_absi * absj[j] *
             (2.0f*B[i][j] + t_absi * absj[j] *
             (VEL[i][j] * VEL[i][j] * fat *
             (16.0f * (B[i][j-1] + B[i][j+1] + B[i-1][j] + B[i+1][j])
              -1.0f * (B[i][j-2] + B[i][j+2] + B[i-2][j] + B[i+2][j]) 
              -60.0f * B[i][j]
             ) - A[i][j]));

        // c2 is a useless variable
        if (abs(C[i][j]) > Amax[i][j]) {
            Amax[i][j] = abs(C[i][j]);
            Ttra[i][j] = t;
         }
    }
}

它可能看起来并不多，但是会对您的代码产生巨大影响.编译器将尝试将局部变量放置在寄存器中(具有更快的访问时间).请记住，由于寄存器数量有限，因此无法无限期地应用此技术，而滥用此技术将导致代码遭受寄存器溢出的困扰.

It may not seem much, but it can have a huge impact on your code. The compiler will try to place local variables in registers (which have a much faster access time). Keep in mind that you cant apply this technique indefinitely since you have an limited number of registers, and abusing this will cause your code to suffer from register spilling.

对于数组absi，您可以避免在执行j循环的过程中让系统将该数组的一部分保留在缓存中.这种技术的总体思路是将不依赖于内部循环变量的任何数组访问都移至外部循环.

In the case of the array absi, you'll avoid having the system keeping a piece of that array in cache during the execution of the j loop. The general idea of this technique is to move to the outer loop any array access that doesn't depend on the inner loop's variable.

这篇关于OpenMP较慢的代码如何并行化?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

OpenMP较慢的代码如何并行化? [英] Slower code with OpenMP how it can be parallelized?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

OpenMP较慢的代码如何并行化? [英] Slower code with OpenMP how it can be parallelized?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭