对OpenMP中的静态调度开销的影响 [英] Influence on the static scheduling overhead in OpenMP

查看:192
本文介绍了对OpenMP中的静态调度开销的影响的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想到了哪些因素会影响OpenMP中的静态调度开销。
在我看来,它受以下影响:

I thought about which factors would influence the static scheduling overhead in OpenMP. In my opinion it is influenced by:


  • CPU性能

  • OpenMP运行时库

  • 线程数

但我缺少其他因素?也许任务的大小,...?

But am I missing further factors? Maybe the size of the tasks, ...?

此外:开销是线性依赖于迭代次数吗?
在这种情况下,我希望有静态调度和4核心,开销与4 * i迭代线性增加。正确到目前为止?

And furthermore: Is the overhead linearly dependent on the number of iterations? In this case I would expect that having static scheduling and 4 cores, the overhead increases linearly with 4*i iterations. Correct so far?

编辑:
我只对静态(!)调度开销本身感兴趣。

I am only interested in the static (!) scheduling overhead itself. I am not talking about thread start-up overhead and time spent in synchronisation and critical section overhead.

推荐答案

您需要将OpenMP创建一个线程组/池的开销,以及每个线程在for循环中操作单独的迭代器集的开销。

You need to separate the overhead for OpenMP to create a team/pool of threads and the overhead for each thread to operate separate sets of iterators in a for loop.

静态调度很容易手工实现(有时非常有用)。让我们考虑一下我认为两个最重要的静态调度 schedule(static) schedule(static,1)可以比较 schedule(dynamic,chunk)

Static scheduling is easy to implement by hand (which is sometimes very useful). Let's consider what I consider the two most important static scheduling schedule(static) and schedule(static,1) then we can compare this to schedule(dynamic,chunk).

#pragma omp parallel for schedule(static)
for(int i=0; i<N; i++) foo(i);

相当于(但不一定等于)

is equivalent to (but not necessarily equal to)

#pragma omp parallel
{
    int start = omp_get_thread_num()*N/omp_get_num_threads();
    int finish = (omp_get_thread_num()+1)*N/omp_get_num_threads();
    for(int i=start; i<finish; i++) foo(i);
}

#pragma omp parallel for schedule(static,1)
for(int i=0; i<N; i++) foo(i);

相当于

#pragma omp parallel 
{
    int ithread = omp_get_thread_num();
    int nthreads = omp_get_num_threads();
    for(int i=ithread; i<N; i+=nthreads) foo(i);
}



从这里你可以看出,实现静态调度是非常简单的,所以

From this you can see that it's quite trivial to implement static scheduling and so the overhead is negligible.

另一方面,如果你想实现 schedule(dynamic)因为 schedule(dynamic,1))的用法比较复杂:

On the other hand if you want to implement schedule(dynamic) (which is the same as schedule(dynamic,1)) by hand it's more complicated:

int cnt = 0;
#pragma omp parallel
for(int i=0;;) {
    #pragma omp atomic capture
    i = cnt++;
    if(i>=N) break;
    foo(i);                                    
}

这需要OpenMP> = 3.1。如果你想使用OpenMP 2.0(对于MSVC),你需要使用临界类似这样

This requires OpenMP >=3.1. If you wanted to do this with OpenMP 2.0 (for MSVC) you would need to use critical like this

int cnt = 0;
#pragma omp parallel
for(int i=0;;) {
    #pragma omp critical   
    i = cnt++;
    if(i>=N) break;
    foo(i);
} 

这相当于(我没有使用原子访问优化它):

Here is an equivalent to schedule(dynamic,chunk) (I have not optimized this using atomic accesss):

int cnt = 0;
int chunk = 5;
#pragma omp parallel
{
    int start, finish;
    do {
        #pragma omp critical
        {
            start = cnt;
            finish = cnt+chunk < N ? cnt+chunk : N;
            cnt += chunk;
        }
        for(int i=start; i<finish; i++) foo(i);
    } while(finish<N);
}

清楚地使用原子访问会导致更多的开销。这也说明了为什么为 schedule(dynamic,chunk)使用较大的块可以减少开销。

Clearly using atomic accesses is going to cause more overhead. This also shows why using larger chunks for schedule(dynamic,chunk) can reduce the overhead.

这篇关于对OpenMP中的静态调度开销的影响的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆