嵌套并行:为什么只有主线程运行并执行四次并行 for 循环? [英] Nested Parallelism : Why only the main thread runs and executes the parallel for loop four times?

查看：46 发布时间：2021/6/4 20:01:28 c multithreading parallel-processing openmp nested-parallelism

本文介绍了嵌套并行:为什么只有主线程运行并执行四次并行 for 循环?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的代码:

#include #include "omp.h";int main() {omp_set_num_threads(4);#pragma omp 并行{#pragma omp 并行用于for (int i = 0; i <6; i++){printf("i = %d, 我是线程 %d\n", i, omp_get_thread_num());}}返回0；}

我得到的输出:

i = 0，我是线程0i = 1，我是线程 0i = 2，我是线程 0i = 0, 我是线程 0i = 1，我是线程 0i = 0, 我是线程 0i = 1，我是线程 0i = 2，我是线程 0i = 2，我是线程 0i = 3，我是线程 0i = 0, 我是线程 0i = 1，我是线程 0i = 3，我是线程 0i = 4，我是线程 0i = 5，我是线程 0i = 2，我是线程 0i = 3，我是线程 0i = 4，我是线程 0i = 5，我是线程 0i = 3，我是线程 0i = 4，我是线程 0i = 5，我是线程 0i = 4，我是线程 0i = 5，我是线程 0

添加平行"是问题的原因，但我不知道如何解释.

我的问题是:为什么只有主线程并运行四次for循环?

解决方案

默认情况下，嵌套并行禁用.尽管如此，您可以通过以下任一方式显式启用 嵌套并行:

 omp_set_nested(1);

或通过将

如果在第一个和第二个parallel region之间添加一个printf语句，如下:

int main() {omp_set_num_threads(4);#pragma omp 并行{printf("嵌套并行区域前:我是线程{%d}\n", omp_get_thread_num());#pragma omp parallel for//添加并行"；是问题的原因，但我不知道如何解释.for (int i = 0; i <6; i++){printf("i = %d, 我是线程 %d\n", i, omp_get_thread_num());}}返回0；}

您将得到类似于以下输出的内容(请记住，输出前 4 行的顺序是不确定的).

嵌套并行区前:我是Thread{1}嵌套并行区域之前:我是线程{0}嵌套并行区域之前:我是线程{2}嵌套并行区域之前:我是线程{3}i = 0, 我是线程 0i = 0, 我是线程 0i = 0, 我是线程 0(……)i = 5，我是线程 0

意味着在第一个 parallel region 内(但仍然在第二个并行区域之外)有一个由 4 个线程组成的团队——IDs 不同于 0 到 3 —— 并行执行.因此，每个线程都将执行 printf 语句:

printf(我是嵌套区域外的线程{%d}\n", omp_get_thread_num());

并为 omp_get_thread_num() 方法调用显示不同的值.

如前所述，嵌套并行被禁用.因此，当每个线程遇到第二个 parallel region 时，每个线程都会创建一个新团队并成为主线程(即将具有 ID=0 在新创建的团队中).—— 也是唯一的成员 —— 该团队的成员.因此，为什么声明

 printf("i = %d, 我是线程 %d\n", i, omp_get_thread_num());

在循环内，总是输出(..) I am Thread 0，因为这个上下文中的方法omp_get_thread_num()将总是返回0代码>.然而，即使方法 omp_get_thread_num() 正在返回 0，这并不意味着该代码正在依次执行(由 ID=0)，而是每个 4 团队的每个 master 返回他们的 ID=0.

如果您启用了嵌套并行性，您将获得如下图所示的流程:

为简单起见省略了线程1到3的执行，但它与线程0相同.

因此，从第一个 parallel region 开始，创建了一个具有 4 个线程的团队.遇到下一个parallel region后，前一队的每个线程，都会创建一个新的4个线程组，所以目前我们总共有16个线程 跨 4 个团队的线程.最后，每个团队将执行整个 for 循环.但是，因为您有一个 #pragma omp parallel for 构造函数，for 循环的迭代将在每个团队内的线程之间分配.

请记住，在上图中，我假设循环之间的迭代具有特定的 static 循环分布，我并不是暗示循环迭代将始终在所有实现中像这样划分OpenMP 标准.

My code:

#include <cstdio>
#include "omp.h"

int main() {
    omp_set_num_threads(4);  
    #pragma omp parallel
    {
        #pragma omp parallel for 
        for (int i = 0; i < 6; i++)
        {
            printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());
        }    
    }
    return 0;
}

The output that I am getting:

i = 0, I am Thread 0
i = 1, I am Thread 0
i = 2, I am Thread 0
i = 0, I am Thread 0
i = 1, I am Thread 0
i = 0, I am Thread 0
i = 1, I am Thread 0
i = 2, I am Thread 0
i = 2, I am Thread 0
i = 3, I am Thread 0
i = 0, I am Thread 0
i = 1, I am Thread 0
i = 3, I am Thread 0
i = 4, I am Thread 0
i = 5, I am Thread 0
i = 2, I am Thread 0
i = 3, I am Thread 0
i = 4, I am Thread 0
i = 5, I am Thread 0
i = 3, I am Thread 0
i = 4, I am Thread 0
i = 5, I am Thread 0
i = 4, I am Thread 0
i = 5, I am Thread 0

Adding "parallel" is the cause of the problem, but I don't know how to explain it.

My Question is: Why is there only the main thread and runs the for loop four times?

解决方案

By default, nested parallelism is disabled. Nonetheless, you can explicitly enable nested parallelism, by either:

   omp_set_nested(1);

or by setting the OMP_NESTED environment variable to true.

also from the OpenMP standard we know that:

When a thread encounters a parallel construct, a team of threads is created to execute the parallel region. The thread that encountered the parallel construct becomes the master thread of the new team, with a thread number of zero for the duration of the new parallel region. All threads in the new team, including the master thread, execute the region. Once the team is created, the number of threads in the team remains constant for the duration of that parallel region.

From source you can read the following.

OpenMP parallel regions can be nested inside each other. If nested parallelism is disabled, then the new team created by a thread encountering a parallel construct inside a parallel region consists only of the encountering thread. If nested parallelism is enabled, then the new team may consist of more than one thread.

This explains the reason why when you add the second parallel region there is only one thread per team executing the enclosing code (i.e., the for loop). In other words, from the first parallel region, 4 threads are created, each of those threads when encountering the second parallel region will create a new team and become the master of that team (i.e., will have the ID=0 within the newly created team). However, because you did not explicitly enable the nested parallelism, each of those teams is only composed of a single thread. Hence, 4 teams with a thread each will execute the for loop. Consequently, you will have the following statement:

printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());

being printed 6 x 4 = 24 times (i.e., the total number of loop iterations multiple by the total number of threads across the 4 teams). The image below provides a visualization of that flow:

If you add a printf statement between the first and the second parallel region, as follows:

int main() {

    omp_set_num_threads(4);
    #pragma omp parallel
    {
       printf("Before nested parallel region:  I am Thread{%d}\n", omp_get_thread_num());
       #pragma omp parallel for // Adding "parallel" is the cause of the problem, but I don't know how to explain it.
       for (int i = 0; i < 6; i++)
       {
           printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());
       }
    }
    return 0;
}

You would get something similar to the following output (bear in mind that the order in which the first 4 lines are outputted is nondeterministic).

Before nested parallel region:  I am Thread{1}
Before nested parallel region:  I am Thread{0}
Before nested parallel region:  I am Thread{2}
Before nested parallel region:  I am Thread{3}
i = 0, I am Thread 0
i = 0, I am Thread 0
i = 0, I am Thread 0
(...)
i = 5, I am Thread 0

Meaning that within the first parallel region (but still outside of the second parallel region) there is a single team of 4 threads -- with IDs varying from 0 to 3 -- executing in parallel. Hence, each of those threads will execute the printf statement:

printf("I am Thread outside the nested region {%d}\n", omp_get_thread_num());

and display a different value for the omp_get_thread_num() method call.

As previously mentioned, the nested parallelism is disabled. Thus, when each of those threads encounters the second parallel region, each will create a new team and becomes the master (i.e., will have the ID=0 within the newly created team). -- and the only member -- of that team. Hence, why the statement

 printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());

inside the loop, outputs always (..) I am Thread 0, since the method omp_get_thread_num() in this context will return always 0. However, even though the method omp_get_thread_num() is returning 0, it does not imply that the code is being executed sequently (by the thread with ID=0), but rather that each master of each of the 4 teams is returning their ID=0.

If you enabled the nested parallelism you will have a flow like shown in the image below:

The execution of threads 1 to 3 was omitted for simplicity sake, nonetheless it would have been the same as thread 0.

So, from the first parallel region, a team with 4 threads is created. After encountering the next parallel region each thread from the previous team, will create a new team of 4 threads each, so at the moment we have a total of 16 threads across 4 teams. Finally, each team will execute the entire for loop. However, because you have a #pragma omp parallel for constructor, the iterations of the for loop will be divided among the threads within each team.

Bear in mind that in the image above, I am assuming a certain static loop distribution of iterations among loops, I am not implying that the loop iterations will always be divided like this across all the implementations of the OpenMP standard.

这篇关于嵌套并行:为什么只有主线程运行并执行四次并行 for 循环?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

嵌套并行:为什么只有主线程运行并执行四次并行 for 循环? [英] Nested Parallelism : Why only the main thread runs and executes the parallel for loop four times?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

嵌套并行:为什么只有主线程运行并执行四次并行 for 循环? [英] Nested Parallelism : Why only the main thread runs and executes the parallel for loop four times?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭