OpenMP:嵌套并行化的好处是什么? [英] OpenMP: What is the benefit of nesting parallelizations?

查看:196
本文介绍了OpenMP:嵌套并行化的好处是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据我的理解, #pragma omp parallel 及其变体基本上在多个并发线程中执行以下块,这对应于CPU数量。当嵌套并行化 - 并行函数内并行时,并行函数中的并行函数等 - 内部并行函数发生了什么?

From what I understand, #pragma omp parallel and its variations basically execute the following block in a number of concurrent threads, which corresponds to the number of CPUs. When having nested parallelizations - parallel for within parallel for, parallel function within parallel function etc. - what happens on the inner parallelization?

我是OpenMP的新手, case我想到的可能是相当微不足道的 - 将一个向量与一个矩阵相乘。这是在两个嵌套for循环中完成的。假设CPU的数量小于向量中的元素数量,尝试并行运行内循环有什么好处吗?

I'm new to OpenMP, and the case I have in mind is probably rather trivial - multiplying a vector with a matrix. This is done in two nested for loops. Assuming the number of CPUs is smaller than the number of elements in the vector, is there any benefit in trying to run the inner loop in parallel? Will the total number of threads be larger than the number of CPUs, or will the inner loop be executed sequentially?

推荐答案

(1)如果你的线程数大于CPU的数量, )OpenMP中的嵌套并行性:
http:/ /docs.oracle.com/cd/E19205-01/819-5270/aewbc/index.html

(1) Nested parallelism in OpenMP: http://docs.oracle.com/cd/E19205-01/819-5270/aewbc/index.html

您需要通过设置来打开嵌套并行性 OMP_NESTED omp_set_nested ,因为许多实现默认关闭此功能,甚至一些实现不完全支持嵌套并行。如果打开,每当你遇到 parallel for ,OpenMP将创建 OMP_NUM_THREADS 中定义的线程数。所以,如果2级并行性,线程的总数将是N ^ 2,其中N = OMP_NUM_THREADS

You need to turn on nested parallelism by setting OMP_NESTED or omp_set_nested because many implementations turn off this feature by default, even some implementations didn't support nested parallelism fully. If turned on, whenever you meet parallel for, OpenMP will create the number of threads as defined in OMP_NUM_THREADS. So, if 2-level parallelism, the total number of threads would be N^2, where N = OMP_NUM_THREADS.

这样的嵌套并行性将引起过度预订(即,忙线程的数量大于核),这可能降低加速。在极端情况下,其中嵌套并行性被递归地调用,线程可能膨胀(例如,创建1000s线程),并且计算机只浪费了上下文切换的时间。在这种情况下,您可以通过设置 omp_set_dynamic 来动态地控制线程数。

Such nested parallelism will cause oversubscription, (i.e., the number of busy threads is greater than the cores), which may degrade the speedup. In an extreme case, where nested parallelism is called recursively, threads could be bloated (e.g., creating 1000s threads), and computer just wastes time for context switching. In such case, you may control the number of threads dynamically by setting omp_set_dynamic.

矩阵向量乘法:代码如下:

(2) An example of matrix-vector multiplication: the code would look like:

// Input:  A(N by M), B(M by 1)
// Output: C(N by 1)
for (int i = 0; i < N; ++i)
  for (int j = 0; j < M; ++j)
     C[i] += A[i][j] * B[j];

一般来说,并行化内部循环而外部循环是可能的是不好的,因为fork / join 。 (虽然许多OpenMP实现预创建线程,但仍然需要一些分派任务到线程,并在并行结束时调用隐式屏障)。

In general, parallelizing inner loops while outer loops are possible is bad because of forking/joining overhead of threads. (though many OpenMP implementations pre-create threads, it still requires some to dispatch tasks to threads and to call implicit barrier at the end of parallel-for)

其中N < #of CPU。是的,对,在这​​种情况下,加速将受到N的限制,让嵌套并行性肯定会有好处。

Your concern is the case of where N < # of CPU. Yes, right, in this case, the speedup would be limited by N, and letting nested parallelism will definitely have benefits.

然而,如果N足够大,那么代码将导致超额预订。我只是想以下解决方案:

However, then the code would cause oversubscription if N is sufficiently large. I'm just thinking the following solutions:


  • 更改循环结构,以便只存在1级循环。 (看起来可行)

  • 专门代码:如果N很小,则执行嵌套并行化,否则不这样做。

  • 嵌套并行与 omp_set_dynamic 。但是,请确保 omp_set_dynamic 如何控制线程数和线程活动。实施方式可能有所不同。

  • Changing the loop structure so that only 1-level loop exists. (It looks doable)
  • Specializing the code: if N is small, then do nested parallelism, otherwise don't do that.
  • Nested parallelism with omp_set_dynamic. But, please make it sure how omp_set_dynamic controls the number of threads and the activity of threads. Implementations may vary.

这篇关于OpenMP:嵌套并行化的好处是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆