通过将#omp parallel和#omp for分开来减少OpenMP fork/join的开销 [英] Reduce OpenMP fork/join overhead by separating #omp parallel and #omp for

查看:93
本文介绍了通过将#omp parallel和#omp for分开来减少OpenMP fork/join的开销的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读彼得·帕切科(Peter S. Pacheco)的《并行编程入门》(em>).在5.6.2节中,它对减少fork/join的开销进行了有趣的讨论. 考虑奇偶转置排序算法:

I'm reading the book An introduction to parallel programming by Peter S. Pacheco. In Section 5.6.2, it gave an interesting discussion about reducing the fork/join overhead. Consider the odd-even transposition sort algorithm:

for(phase=0; phase < n; phase++){
    if(phase is even){
#       pragma omp parallel for default(none) shared(n) private(i)
        for(i=1; i<n; i+=2){//meat}
    }
    else{
#       pragma omp parallel for default(none) shared(n) private(i)
        for(i=1; i<n-1; i+=2){//meat}
    }
}

作者认为上面的代码具有较高的派生/连接开销.因为线程是在外循环的每次迭代中分叉并连接在一起的.因此,他提出了以下版本:

The author argues that the above code has somewhat high fork/join overhead. Because the threads are forked and joined in each iteration of the outer loop. Hence, he proposes the following version:

# pragma omp parallel default(none) shared(n) private(i, phase)
for(phase=0; phase < n; phase++){
    if(phase is even){
#       pragma omp for
        for(i=1; i<n; i+=2){//meat}
    }
    else{
#       pragma omp for
        for(i=1; i<n-1; i+=2){//meat}
    }
}

根据作者的说法,第二个版本在外部循环开始之前分叉线程,并在每次迭代中重用线程,从而产生更好的性能.

According to the authors, the second version forks the threads before the outer loop starts and reuse the threads for each iterations, yielding better performance.

但是,我对第二个版本的正确性表示怀疑.在我的理解中,#pragma omp parallel指令会启动一组线程,并让这些线程并行执行以下结构化块.在这种情况下,结构化块应该是整个外部for循环for(phase=0 ...).然后,难道不是在给定4个线程的情况下,整个外部循环执行了4次的情况?也就是说,如果n=10,则将在4个线程上执行40次迭代.我的理解有什么问题?以及omp parallel(不带for)如何与上面的以下for循环一起播放?

However, I'm suspicious of the correctness of the second version. In my understanding, an #pragma omp parallel directive initiates a group of threads and let the threads execute the following structured block in parallel. In this case, the structured block should be the whole outer for-loop for(phase=0 ...). Then, shouldn't it be the case where the whole outer loop is executed four time given 4 threads are used? That is, if n=10, then 40 iterations would be executed on 4 threads. What is wrong with my understanding? And how does the omp parallel (without for) play with a following for-loop like above?

推荐答案

第二个版本正确.

按照OpenMP规范,#pragma omp parallel for指令只是#pragma omp parallel的快捷方式,紧随其后的是#pragma omp for,如

Per the OpenMP specification, a #pragma omp parallel for directive is just a shortcut for #pragma omp parallel immediately followed by #pragma omp for, as in

#pragma omp parallel
{
    #pragma omp for
    for(int i=0; i<N; ++i) { /*loop body*/ }
}

如果在循环结构之前或之后的并行区域中有一些代码,则该代码将由该区域中的每个线程独立执行(除非受其他OpenMP指令限制).但是,#pragma omp for是一个工作共享结构;该指令之后的循环由该区域中的所有线程共享. IE.它作为一个循环执行,迭代以某种方式在线程之间拆分.因此,如果上面的并行区域由4个线程执行,则该循环仍将仅执行一次,而不是4次.

If there is some code in the parallel region before or after the loop construct, it will be executed independently by each thread in the region (unless limited by other OpenMP directives). But, #pragma omp for is a work sharing construct; the loop following that directive is shared by all threads in the region. I.e. it's executed as a single loop which iterations are somehow split across the threads. Thus, if the parallel region above is executed by 4 threads, still the loop will be executed just once, and not 4 times.

回到您的问题示例:每个线程单独执行阶段循环,但是每个阶段迭代的#pragma omp for表示共享循环的开始.对于n = 10,每个线程将进入一个共享循环10次,并执行其中的一部分.这样就不会有40个内部循环的执行,而只有10个.

Back to the example in your question: the phase loop is executed by each thread individually, but #pragma omp for at each phase iteration indicates start of a shared loop. For n=10, each thread will enter a shared loop 10 times, and execute a part of it; so there won't be 40 executions of the inner loops, but just 10.

请注意,在#pragma omp for的末尾有一个隐式障碍.这意味着完成共享循环部分的线程要等到所有其他线程也完成它们的部分,才能继续进行.因此,执行是跨线程同步的.在大多数情况下,这是确保正确性所必需的;例如在您的示例中,这保证了线程始终在同一阶段工作.但是,如果可以安全地同时执行区域中的后续共享循环,则可以使用nowait子句消除隐式障碍,并允许线程立即进入并行区域的其余部分.

Note there is an implicit barrier at the end of #pragma omp for; it means a thread that completed its portion of a shared loop will not proceed until all other threads complete their portions too. So, the execution is synchronized across the threads. This is necessary to ensure correctness in most cases; e.g. in your example this guarantees that threads always work on the same phase. However if consequent shared loops within a region are safe to execute simultaneously, a nowait clause can be used to eliminate the implicit barrier and allow threads immediately proceed to the rest of the parallel region.

还请注意,工作共享指令的这种处理非常特定于OpenMP.在其他并行编程框架中,您在问题中使用的逻辑可能是正确的.

Note also that such processing of work sharing directives is quite specific to OpenMP. With other parallel programming frameworks, the logic you used in the question might be correct.

最后,在并行区域完成后,智能OpenMP实现不加入线程;相反,线程可能会忙于等待一段时间,然后进入休眠状态,直到启动另一个并行区域.这样做是为了防止在并行区域的开始和结束时产生高开销.因此,尽管书中建议的优化仍然可以消除一些开销(也许),但对于某些算法,其对执行时间的影响可以忽略不计.问题中的算法很可能是其中之一;在第一个实现中,并行区域在串行循环内迅速地一个接一个地跟随,因此OpenMP工作线程很可能会在区域的开头处于活动状态并快速启动它,从而避免了派生/加入开销.因此,如果您在实践中发现所描述的优化没有任何性能差异,请不要感到惊讶.

And last, smart OpenMP implementations do not join threads after a parallel region is completed; instead, threads might busy-wait for some time, and then sleep until another parallel region is started. This is done exactly to prevent high overheads at the start and the end of parallel regions. So, while the optimization suggested in the book still removes some overhead (perhaps), for some algorithms its impact on execution time might be negligible. The algorithm in the question is quite likely one of these; in the first implementation, parallel regions quickly follow one after another within the serial loop, so OpenMP worker threads most likely will be active at the beginning of a region and will start it quickly, avoiding the fork/join overhead. So don't be surprised if in practice you see no performance difference from the described optimization.

这篇关于通过将#omp parallel和#omp for分开来减少OpenMP fork/join的开销的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆