OpenMP动态与引导式计划 [英] OpenMP Dynamic vs Guided Scheduling

查看:80
本文介绍了OpenMP动态与引导式计划的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究OpenMP的调度,特别是不同类型的调度.我了解每种类型的一般行为,但是澄清一下何时在dynamicguided调度之间进行选择会很有帮助.

I'm studying OpenMP's scheduling and specifically the different types. I understand the general behavior of each type, but clarification would be helpful regarding when to choose between dynamic and guided scheduling.

英特尔文档描述dynamic计划:

Intel's docs describe dynamic scheduling:

使用内部工作队列来给出一个块大小的循环块 每个线程的迭代.线程完成后,它将检索 从工作队列顶部开始的下一个循环迭代块.经过 默认情况下,块大小为1.使用此调度时要小心 类型,因为会涉及额外的开销.

Use the internal work queue to give a chunk-sized block of loop iterations to each thread. When a thread is finished, it retrieves the next block of loop iterations from the top of the work queue. By default, the chunk size is 1. Be careful when using this scheduling type because of the extra overhead involved.

它还描述了guided调度:

类似于动态调度,但是块大小开始时很大,并且 减少以更好地处理迭代之间的负载不平衡.这 可选的chunk参数指定它们使用的最小大小的chunk.经过 默认情况下,块大小大约为loop_count/number_of_threads.

Similar to dynamic scheduling, but the chunk size starts off large and decreases to better handle load imbalance between iterations. The optional chunk parameter specifies them minimum size chunk to use. By default the chunk size is approximately loop_count/number_of_threads.

由于guided调度会在运行时动态减小块大小,所以为什么要使用dynamic调度?

Since guided scheduling dynamically decreases the chunk size at runtime, why would I ever use dynamic scheduling?

我已经研究了这个问题,并且从达特茅斯找到了这张桌子:

I've researched this question and found this table from Dartmouth:

guided具有开销high,而dynamic具有中等开销.

guided is listed as having high overhead, while dynamic has medium overhead.

这最初是有道理的,但经过进一步调查,我

This initially made sense, but upon further investigation I read an Intel article on the topic. From the previous table, I theorized guided scheduling would take longer because of the analysis and adjustments of the chunk size at runtime (even when used correctly). However, in the Intel article it states:

引导的计划以最小的块大小作为限制时效果最佳;这 提供最大的灵活性.尚不清楚他们为什么会变得更糟 较大的块大小,但如果限制为大块,它们可能会花费太长时间 块大小.

Guided schedules work best with small chunk sizes as their limit; this gives the most flexibility. It’s not clear why they get worse at bigger chunk sizes, but they can take too long when limited to large chunk sizes.

为什么与guided有关的块大小要比dynamic花费更长的时间?缺乏灵活性"会通过将块大小锁定得太大而导致性能损失,这是有道理的.但是,我不会将其描述为开销",并且锁定问题会破坏先前的理论.

Why would the chunk size relate to guided taking longer than dynamic? It would make sense for the lack of "flexibility" to cause performance loss through locking the chunk size too high. However, I would not describe this as "overhead", and the locking problem would discredit previous theory.

最后,它在文章中指出:

Lastly, it's stated in the article:

动态时间表可提供最大的灵活性,但需要最大的灵活性 排定的错误会导致性能下降.

Dynamic schedules give the most flexibility, but take the biggest performance hit when scheduled wrong.

dynamic调度比static最优,这很有意义,但是为什么它比guided最优呢?只是我在问的开销吗?

It makes sense for dynamic scheduling to be more optimal than static, but why is it more optimal than guided? Is it just the overhead I'm questioning?

这篇有关SO的帖子介绍了与调度类型有关的NUMA.这与这个问题无关,因为所需的组织因这些调度类型的先到先得"行为而丢失.

This somewhat related SO post explains NUMA related to the scheduling types. It's irrelevant to this question, since the required organization is lost by the "first come, first served" behavior of these scheduling types.

dynamic的调度可能会合并,从而导致性能提高,但是对于guided,同样的假设也应适用.

dynamic scheduling may be coalescent, causing performance improvement, but then the same hypothetical should apply to guided.

以下是英特尔文章中每种调度类型在不同块大小上的计时,以供参考.它只是来自一个程序的记录,并且某些规则对每个程序和机器(特别是在计划中)的应用各不相同,但是它应该提供总体趋势.

Here's the timing of each scheduling type across different chunk sizes from the Intel article for reference. It's only recordings from one program and some rules apply differently per program and machine (especially with scheduling), but it should provide the general trends.

编辑(我的问题的核心):

EDIT (core of my question):

  • 什么因素影响guided调度的运行时间?具体的例子?为什么在某些情况下它比dynamic慢?
  • 什么时候我会更喜欢guided而不是dynamic?
  • 一旦对此进行了解释,以上来源是否支持您的解释?他们有矛盾吗?
  • What affects the runtime of guided scheduling? Specific examples? Why is it slower than dynamic in some cases?
  • When would I favor guided over dynamic or vice-versa?
  • Once this has been explained, do the sources above support your explanation? Do they contradict at all?

推荐答案

什么影响引导式调度的运行时间?

What affects the runtime of guided scheduling?

要考虑三个影响:

在并非每个循环迭代都包含相同数量的工作的情况下,动态/引导式调度的重点是改善工作分配.根本上:

The whole point of dynamic/guided scheduling is to improve the distribution of work in the case where not each loop iteration contains the same amount of work. Fundamentally:

  • schedule(dynamic, 1)提供最佳负载平衡
  • guided, k 相比,
  • dynamic, k始终具有相同或更好的负载平衡
  • schedule(dynamic, 1) provides optimal load balancing
  • dynamic, k will always have same or better load balancing than guided, k

标准规定,每个块的大小与未分配的迭代数除以团队中的线程数成比例, 减少到k.

The standard mandates that the size of each chunk is proportional to the number of unassigned iterations divided by the number of threads in the team, decreasing to k.

GCC OpenMP限制完全是这样,忽略比例.例如,对于4个线程k=1,它将作为8, 6, 5, 4, 3, 2, 1, 1, 1, 1进行32次迭代.现在,恕我直言,这确实很愚蠢:如果前1/n迭代包含工作的1/n以上,则会导致严重的负载不平衡.

The GCC OpenMP implemntation takes this literally, ignoring the proportional. For example, for 4 threads, k=1, it will 32 iterations as 8, 6, 5, 4, 3, 2, 1, 1, 1, 1. Now IMHO this is really stupid: It result in a bad load imbalance if the first 1/n iterations contain more than 1/n of the work.

具体例子?为什么在某些情况下比动态速度慢?

Specific examples? Why is it slower than dynamic in some cases?

好的,让我们看一个简单的示例,其中内部工作随循环迭代而减少:

Ok, lets take a look at a trivial example where the inner work decreases with the loop iteration:

#include <omp.h>

void work(long ww) {
    volatile long sum = 0;
    for (long w = 0; w < ww; w++) sum += w;
}

int main() {
    const long max = 32, factor = 10000000l;
    #pragma omp parallel for schedule(guided, 1)
    for (int i = 0; i < max; i++) {
         work((max - i) * factor);
    }
}

执行看起来像这样 1 :

如您所见,guided在这里确实很糟糕. guided对于不同类型的工作分配将做得更好.也有可能实施不同的指导.使用clang(IIRC源自英特尔)的实现为更复杂.我真的不理解GCC天真的实施背后的想法.在我看来,如果将1/n工作分配给第一个线程,它将有效地消除动态负载平衡的目的.

As you can see, guided does really bad here. guided will do much better for different kinds of work distributions. It is also possible to implement guided differently. The implementation in clang (which IIRC stems from Intel), is much more sophisticated. I truely do not understand the idea behind GCC's naive implementation. In my eyes it effectively defeates the purpose of dynamic load blancing if you give 1/n work to the first thread.

现在,由于访问共享状态,每个动态块都会对性能产生一些影响.每个块guided的开销将比dynamic稍高,因为还有更多的计算要做.但是,guided, k的总动态块将少于dynamic, k.

Now each dynamic chunk has some performance impact due to accessing shared state. The overhead of guided will be slightly higher per chunk than dynamic, as there is a bit more computation to do. However, guided, k will have less total dynamic chunks than dynamic, k.

开销也将取决于实现,例如是使用原子还是使用锁来保护共享状态.

The overhead will also depend on the implementation, e.g. whether it uses atomics or locks for protecting the shared state.

假设您在循环迭代中写入整数向量.如果第二次迭代要由不同的线程执行,则向量的第二个元素将由不同的内核写入.这样做确实很糟糕,因为这样做会使它们竞争包含相邻元素的缓存行(错误共享).如果块大小较小和/或块大小无法很好地与缓存对齐,则块的边缘"会导致性能下降.这就是为什么您通常会选择较大的(2^n)大块大小. guided平均可以为您提供更大的块大小,但不能提供2^n(或k*m).

Let's say write to a vector of integers in your loop iteration. If every second iteration was to be executed by a different thread, every second element of the vector would be written by a different core. That's really bad because by doing so, they compete cache-lines that contain neighbouring elements (false sharing). If you have small chunk sizes and/or chunk sizes that don't align nicely to caches, you get bad performance at the "edges" of chunks. This is why you usually prefer large nice (2^n) chunk sizes. guided can give you larger chunk sizes on average, but not 2^n (or k*m).

此答案(您已经引用过)从NUMA的角度详细讨论了动态/引导调度的缺点,但这也适用于地区/缓存.

This answer (that you already referenced), discusses at length the disadvantage of dynamic/guided scheduling in terms of NUMA, but this also applies to locality/caches.

鉴于各种因素和难以预测的细节,我仅建议您使用特定的编译器在特定的系统,特定的配置中测量特定的应用程序.不幸的是,没有完美的性能可移植性.我个人认为,这对于guided尤其如此.

Given the various factors and difficulty to predict specifics, I can only recommend to measure your specific application, on your specific system, in your specific configuration, with your specific compiler. Unfortunately there is no perfect performance portability. I would personally argue, that this is particularly true for guided.

什么时候我更喜欢引导而不是动态指导?

When would I favor guided over dynamic or vice-versa?

如果您对间接费用/每次迭代的工作有特定的了解,我想说dynamic, k通过选择一个好的k可以为您提供最稳定的结果.特别是,您并不太依赖实现的巧妙程度.

If you have specific knowledge about the overhead / work-per-iteration, I would say that dynamic, k gives you the most stable results by chosing a good k. In particular, you do not depend so much on how clever the implementation is.

另一方面,至少对于一个聪明的实现来说,guided可能是一个不错的第一选择,具有合理的开销/负载平衡比.如果您知道以后的迭代时间更短,请特别注意guided.

On the other hand, guided may be a good first guess, with a reasonable overhead / load balancing ratio, at least for a clever implementation. Be particularly careful with guided if you know that later iteration times are shorter.

请记住,还有schedule(auto)可以完全控制编译器/运行时,而schedule(runtime)则可以让您在运行时选择调度策略.

Keep in mind, there is also schedule(auto), which gives complete control to the compiler/runtime, and schedule(runtime), which allows you to select the scheduling policy during runtime.

一旦对此进行了解释,以上来源是否支持您的解释?他们有矛盾吗?

Once this has been explained, do the sources above support your explanation? Do they contradict at all?

用一粒盐把源头(包括这个分析器)拿来.您发布的图表和我的时间轴图片都不是科学上准确的数字.结果差异很大,没有误差线,可能只有很少的数据点到处都是.该图表还结合了我提到的多种效果,而没有公开Work代码.

Take the sources, including this anser, with a grain of salt. Neither the chart you posted, nor my timeline picture are scientifically accurate numbers. There is a huge variance in results and there are no error bars, they would probably just be all over the place with these very few data points. Also the chart combines the multiple effects I mentioned without disclosing the Work code.

[来自Intel文档]

[From Intel docs]

默认情况下,块大小约为loop_count/number_of_threads.

By default the chunk size is approximately loop_count/number_of_threads.

这与我的观点相矛盾,即icc可以更好地处理我的小例子.

This contradicts my observation that icc handles my small example much better.

1:使用GCC 6.3.1,Score-P/Vampir进行可视化,两个交替的工作函数进行着色.

这篇关于OpenMP动态与引导式计划的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆