未指定块大小的OpenMP计划(静态):块大小和分配顺序 [英] OpenMP schedule(static) with no chunk size specified: chunk size and order of assignment

查看:51
本文介绍了未指定块大小的OpenMP计划(静态):块大小和分配顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些关于#pragma omp for schedule(static)的问题,其中未指定块大小.

I have a few questions regarding #pragma omp for schedule(static) where the chunk size is not specified.

在OpenMP中并行化循环的一种方法是像这样手动进行:

One way to parallelize a loop in OpenMP is to do it manually like this:

#pragma omp parallel 
{
    const int nthreads = omp_get_num_threads();
    const int ithread = omp_get_thread_num();
    const int start = ithread*N/nthreads;
    const int finish = (ithread+1)*N/nthreads;
    for(int i = start; i<finish; i++) {
        //          
    }
}

是否有充分的理由不在OpenMP中手动并行处理这样的循环?如果将值与#pragma omp for schedule(static)进行比较,我会发现给定线程的块大小并不总是同意,因此OpenMP(在GCC中)实现的卡盘尺寸不同于startfinish中定义的卡盘尺寸.为什么会这样?

Is there a good reason not to do manually parallelize a loop like this in OpenMP? If I compare the values with #pragma omp for schedule(static) I see that the chunk sizes for a given thread don't always agree so OpenMP (in GCC) is implementing the chuck sizes different than as defined in start and finish. Why is this?

我定义的startfinish值具有几个方便的属性.

The start and finish values I defined have several convenient properties.

  1. 每个线程最多获得一个块.
  2. 迭代值的范围直接随线程数增加(即,对于具有两个线程的100个线程,第一个线程将 处理迭代1-50和第二个线程51-100,而不是相反.
  3. 对于两个在完全相同范围内的for循环,每个线程将在完全相同的迭代上运行.
  1. Each thread gets at most one chunk.
  2. The range of values for iterations increase directly with thread number (i.e. for 100 threads with two threads the first thread will process iterations 1-50 and the second thread 51-100 and not the other way around).
  3. For two for loops over exactly the same range each thread will run over exactly the same iterations.

最初,我说的只是一个块,但是考虑到它之后,如果线程数比N大得多,则该块的大小可能为零. ithread*N/nthreads = (ithread*1)*N/nthreads).我真正想要的属性最多是一块.

Original I said exactly one chunk but after thinking about it it's possible for the size of the chunk to be zero if the number of threads is much larger than N (ithread*N/nthreads = (ithread*1)*N/nthreads). The property I really want is at most one chunk.

使用#pragma omp for schedule(static)时是否保证所有这些属性?

Are all these properties guaranteed when using #pragma omp for schedule(static)?

根据OpenMP规范:

According to the OpenMP specifications:

依赖于哪个线程在任何其他情况下执行特定迭代的程序都是不合格的.

Programs that depend on which thread executes a particular iteration under any other circumstances are non-conforming.

具有相同调度和迭代次数的不同循环区域,即使它们出现在相同的并行区域中,也可以以不同的方式在线程之间分配迭代比率.唯一的例外是静态时间表

对于schedule(static),规范说:

将块按线程号的顺序以循环方式分配给团队中的线程.

chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number.

此外,规范还针对"schedule(静态)"进行了说明:

Additionally the specification says for `schedule(static):

如果未指定chunk_size,则将迭代空间划分为大小大致相等的块,并且每个线程最多分配一个块.

When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread.

最后,规格说明为schedule(static):

静态时间表的合规实施必须确保 将逻辑迭代编号分配给线程的相同方法将在两个中使用 如果满足以下条件,则循环区域:1)两个循环区域具有相同的循环迭代次数; 2)两个循环区域均指定相同的chunk_size值;或者两个循环区域均未指定chunk_size; 3)两个循环区域都绑定到相同的平行区域.

A compliant implementation of the static schedule must ensure that the same assignment of logical iteration numbers to threads will be used in two loop regions if the following conditions are satisfied: 1) both loop regions have the same number of loop iterations, 2) both loop regions have the same value of chunk_size specified, or both loop regions have no chunk_size specified, 3) both loop regions bind to the same parallel region.

因此,即使我的代码依赖于线程执行特定的迭代,因此如果我正确阅读了schedule(static),它们将具有与列为startfinish相同的便捷属性. 我能正确解释吗?当未指定块大小时,这似乎是schedule(static)的特殊情况.

So if I read this correctly schedule(static) will have the same convenient properties I listed as start and finish even though my code relies on thread executes a particular iteration. Do I interpret this correctly? This seems to be a special case for schedule(static) when the chunk size is not specified.

像我一样定义startfinish然后尝试中断这种情况下的规范会更容易.

It's easier to just define start and finish like I did then try and interrupt the specification for this case.

推荐答案

是否有充分的理由不在OpenMP中手动并行化这样的循环?

首先想到的是

  1. 您已向OpenMP库添加了一个依赖项,因此,如果要维护串行编译,则必须复制部分代码,或者必须为库函数调用提供存根.
  2. 工作共享结构所需的代码更少,并且比显式并行化的循环更易于阅读.如果您需要维护大型代码库,那将很重要.
  3. 您错过了除schedule(static)之外的所有计划机会.
  4. 处理低级详细信息很容易适得其反.
  1. You added a dependency to the OpenMP library, and thus you either have to duplicate part of your code if you want to maintain a serial compilation or you have to provide stubs for the library function calls.
  2. A worksharing construct requires less code and is more convenient to read than an explicitly parallelized loop. If you need to maintain a large code-base, that would matter.
  3. You miss all the scheduling opportunity apart schedule(static).
  4. Messing with low-level details can easily back-fire.

在将#pragam omp用于日程表(静态)时是否保证所有这些属性?

让我们一看一看:

1.)每个线程恰好得到一个块

如果未指定chunk_size,则将迭代空间划分为 大小大致相等的块,最多一个块是 分配给每个线程.请注意,块的大小为 在这种情况下未指定.

When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread. Note that the size of the chunks is unspecified in this case.

最多一个块不是正好一个块.因此,一个财产没有实现.除此之外,这就是为什么未指定块大小的原因.

At most one chunk is not exactly one chunk. So property one is not fulfilled. Besides this is why the size of the chunk is unspecified.

2.)迭代的值范围随线程数直接增加(即,对于具有两个线程的100个迭代,第一个线程将 处理迭代1-50和第二个线程51-100,而不是 反之亦然)

2.) The range of values for iterations increase directly with thread number (i.e. for 100 iterations with two threads the first thread will process iterations 1-50 and the second thread 51-100 and not the other way around)

静态的兼容实现 进度表必须确保逻辑迭代的分配相同 如果满足以下条件,则将在两个循环区域中使用线程号 满足条件:

A compliant implementation of the static schedule must ensure that the same assignment of logical iteration numbers to threads will be used in two loop regions if the following conditions are satisfied:

  1. 两个循环区域的循环迭代次数相同
  2. 两个循环区域都指定了相同的chunk_size值,或者两个循环区域都未指定了chunk_size
  3. 两个环区域都绑定到相同的并行区域.
  1. both loop regions have the same number of loop iterations
  2. both loop regions have the same value of chunk_size specified, or both loop regions have no chunk_size specified
  3. both loop regions bind to the same parallel region.

两个这样的相同逻辑迭代之间的数据依赖性 保证可以满足循环要求,从而可以安全地使用nowait 子句(有关示例,请参见第182页的A.10节).

A data dependence between the same logical iterations in two such loops is guaranteed to be satisfied allowing safe use of the nowait clause (see Section A.10 on page 182 for examples).

即使我从未见过与您所说的不同的东西,但我敢说即使第二个属性也没有实现,至少对于schedule(static)来说没有.在我看来,在一定基数的迭代空间中,唯一的保证是,如果遵守条件1、2和3,则将对相同的线程赋予相同的逻辑迭代数".

Even though I never saw something different from what you say, I dare say that even property two is not fulfilled, at least not for schedule(static). It seems to me that in an iteration space of a certain cardinality the only guarantee is that the same "logical iteration numbers" will be given to the same thread if condition 1, 2 and 3 are respected.

如果您指定块大小,则确实可以实现:

It is indeed granted if you specify the chunk size:

当指定了schedule(static,chunk_size)时,迭代将被划分 分成大小为chunk_size的块,并将这些块分配给 按照以下顺序以循环方式在团队中穿线 线程号.

When schedule(static, chunk_size) is specified, iterations are divided into chunks of size chunk_size, and the chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number.

3.)对于两个在完全相同范围内的for循环,每个线程将在完全相同的迭代上运行

这的确是允许的,甚至是更通用的:对于两个具有相同基数迭代空间的循环,将为每个线程赋予相同的逻辑迭代数". OpenMP 3.1标准的Example A.10.2c应该澄清这一点.

This is indeed granted, and is even more general: for two loops with an iteration space of the same cardinality, the same "logical iteration numbers" will be given to each thread. Example A.10.2c of the OpenMP 3.1 standard should clarify this point.

这篇关于未指定块大小的OpenMP计划(静态):块大小和分配顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆