糟糕的性能-一个简单的开销问题,或者程序存在缺陷吗? [英] Terrible performance - a simple issue of overhead, or is there a program flaw?

查看:82
本文介绍了糟糕的性能-一个简单的开销问题,或者程序存在缺陷吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在这里,我将自己理解为一个相对简单的OpenMP构造.问题在于,与2个线程相比,使用1个线程的程序运行速度快100-300倍. 87%的程序花费在gomp_send_wait()中,另外9.5%花费在gomp_send_post中.

I have here what I understand to be a relatively simple OpenMP construct. The issue is that the program runs about 100-300x faster with 1 thread when compared to 2 threads. 87% of the program is spent in gomp_send_wait() and another 9.5% in gomp_send_post.

该程序给出了正确的结果,但是我想知道代码中是否存在导致某些资源冲突的缺陷,或者仅仅因为线程创建的开销对于块大小的循环而言根本不值得4. p的范围是17到1000,具体取决于我们正在模拟的分子的大小.

The program gives correct results, but I wonder if there is a flaw in the code that is causing some resource conflict, or if it is simply that the overhead of the thread creation is drastically not worth it for a a loop of chunk size 4. p ranges from 17 to 1000, depending on the size of the molecule we're simulating.

当p为17且块大小为4时,我的数字是最坏的情况.无论我使用的是静态,动态还是引导式调度,性能都是相同的.在p=150且块大小为75的情况下,该程序仍比串行程序慢75x-100x.

My numbers are for the worst case, when p is 17 and the chunk size 4. The performance is the same whether I'm using static, dynamic, or guided scheduling. With p=150 and chunk size 75, the program is still 75x-100x slower than serial.

...
    double e_t_sum=0.0;
    double e_in_sum=0.0;

    int nthreads,tid;

    #pragma omp parallel for schedule(static, 4) reduction(+ : e_t_sum, e_in_sum) shared(ee_t) private(tid, i, d_x, d_y, d_z, rr,) firstprivate( V_in, t_x, t_y, t_z) lastprivate(nthreads)
    for (i = 0; i < p; i++){
        if (i != c){
            nthreads = omp_get_num_threads();               
            tid = omp_get_thread_num();

            d_x = V_in[i].x - t_x; 
            d_y = V_in[i].y - t_y;
            d_z = V_in[i].z - t_z;


            rr = d_x * d_x + d_y * d_y + d_z * d_z;

            if (i < c){

                ee_t[i][c] = energy(rr, V_in[i].q, V_in[c].q, V_in[i].s, V_in[c].s);
                e_t_sum += ee_t[i][c]; 
                e_in_sum += ee_in[i][c];    
            }
            else{

                ee_t[c][i] = energy(rr, V_in[i].q, V_in[c].q, V_in[i].s, V_in[c].s);
                e_t_sum += ee_t[c][i]; 
                e_in_sum += ee_in[c][i];    
            }

            // if(pid==0){printf("e_t_sum[%d]: %f\n", tid, e_t_sum[tid]);}

        }
    }//end parallel for 


        e_t += e_t_sum;
        e_t -= e_in_sum;            

...

推荐答案

首先,我认为在这种情况下优化您的串行代码不会帮助您解决OpenMP难题.不用担心

First, I don't think optimizing your serial code in this case will help answer your OpenMP dilemna. Don't worry about it.

IMO对于这种减速情况有三种可能的解释:

IMO there are three possible explanations for the slowdown:

  1. 这可以很容易地解释减速.数组ee_t的元素导致高速缓存行内的错误共享.错误共享是指内核最终写到同一高速缓存行而不是因为它们实际上是在共享数据,而是当内核正在写入的内容恰好在同一高速缓存行中时(这就是所谓的 false )分享).如果您没有在Google上找到虚假的共享信息,我可以进一步解释.使ee_t元素高速缓存行对齐可能会很有帮助.

  1. This one can explain a slowdown easily. The elements of the array ee_t are leading to false sharing within the cache line. False sharing is when cores end up writing to the same cache line not becuase they are actually sharing data but when what the cores are writing just happens to be in the same cache line (which is why its called false sharing). I can explain more if you dont find false sharing on google. Making ee_t elements cache line aligned may help a lot.

产生工作的开销高于并行性的好处.您是否尝试了少于8个内核? 2核的性能如何?

The overhead of spawning work is higher than the parallelism benefit. Have you tried fewer than 8 cores? How is performance at 2 cores?

迭代的总数很小,以17为例.如果将其拆分为8个内核,则会遇到负载失衡的问题(特别是因为您的某些迭代实际上不做任何工作(当i == c时).至少一个内核将必须进行3次迭代,而其他所有迭代do 2.这并不能说明速度减慢,但肯定是加速不如您期望的那么快的一个原因.由于您的迭代长度不尽相同,因此我将使用块大小为1的动态调度或使用openmp引导.用块大小进行实验,块太小也会导致速度变慢.

The total number of iterations is small, say we take 17 as an example. If you split it across 8 cores, it will suffer load imbalance problems (specially since some of your iterations are practically not doing any work (when i == c). At least one core will have to do 3 iterations, while all others will do 2. This does not explain a slow down but surely one reason why speedup is not as high as you may expect. Since your iterations are of varying lengths, I would use a dynamic schedule with a chunk size of 1 or use openmp guided. Experiment with the chunk size, a chunk too small will also lead to slowdown.

让我知道如何进行.

这篇关于糟糕的性能-一个简单的开销问题,或者程序存在缺陷吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆