OpenMP:计划 [英] OpenMP: for schedule

查看:121
本文介绍了OpenMP:计划的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用OpenMP C ++和我有一些问题:



1)什么是#pragma omp的日程?
2)动态和静态之间有什么区别?



这些命令调度线程来工作,结束这一个并启动另一个?
我已经在互联网上搜索,但我不明白这个命令做什么。

解决方案



<其他人已经回答了大部分问题,但我想指出一些特定的情况下,特定的调度类型比其他更适合。 Schedule控制循环迭代在线程之间的划分方式。选择正确的时间表会对应用程序的速度产生很大的影响。



static 以循环方式静态地映射到执行线程。静态调度的好处是,OpenMP运行时保证,如果你有两个独立的循环具有相同的迭代次数,并使用静态调度使用相同数量的线程执行它们,那么每个线程将接收完全相同的迭代范围s)。这在NUMA系统上是非常重要的:如果你在第一个循环中触摸一些内存,它将驻留在执行线程所在的NUMA节点上。然后在第二个循环中,同一个线程可以更快地访问同一个内存位置,因为它将驻留在同一个NUMA节点上。

想象一下,有两个NUMA节点:节点0和节点1,例如一个双插槽英特尔Nehalem板,在两个插座中都有4核CPU。然后线程0,1,2和3将驻留在节点0上,线程4,5,6和7将驻留在节点1上:

  | |核0 |线程0 | 
| socket 0 |核1线程1 |
| NUMA节点0 |核2 |线程2 |
| |核3 |线程3 |

| |核4 |线程4 |
|插座1 |核5 |线程5 |
| NUMA节点1 |核6 |线程6 |
| |核心7 |线程7 |

每个核心都可以从每个NUMA节点访问内存,但远程访问速度较慢(1.5x - 1.9x比英特尔慢)比本地节点访问。你运行这样的:

  char * a =(char *)malloc(8 * 4096); 

#pragma omp parallel for schedule(static,1)num_threads(8)
for(int i = 0; i <8; i ++)
memset(& a [i * 4096],0,4096);

在这种情况下,4096字节是Linux on x86上一个内存页的标准大小,不曾用过。这个代码将整个32 KiB数组归零 a malloc()调用只保留虚拟地址空间,但实际上并不触摸物理内存(这是默认行为,除非一些其他版本的使用malloc ,例如将 calloc()的内存置零。现在这个数组是连续的,但只在虚拟内存中。在物理内存中,它的一半位于连接到套接字0的内存中,一半位于连接到套接字1的内存中。这是因为不同的部分由不同的线程归零,并且这些线程驻留在不同的内核上, >第一次触摸 NUMA策略,这意味着内存页在首次触摸内存页的线程所在的NUMA节点上分配。

  | |核0 |线程0 | a [0] ... a [b] socket 0 |核1线程1 | a [4096] ... a [8191] 
| NUMA节点0 |核2 |线程2 | a [8192] ... a [12287]
| |核3 |线程3 | a [12288] ... a [16383]

| |核4 |线程4 | a [16384] ... a [20479]
|插座1 |核5 |线程5 | a [20480] ... a [24575]
| NUMA节点1 |核6 |线程6 | a [24576] ... a [28671]
| |核心7 |线程7 | a [28672] ... a [32768]

现在让我们运行另一个循环: p>

  #pragma omp parallel for schedule(static,1)num_threads(8)
for(i = 0; i < 8; i ++)
memset(& a [i * 4096],1,4096);

每个线程将访问已映射的物理内存,并且线程与内存区域具有相同的映射作为第一循环期间的一个。这意味着线程将只访问位于其本地存储器块中的存储器,这将是快速的。



现在想象另一个调度方案用于第二个循环: schedule(static,2)。这将把迭代空间砍成两次迭代的块,并且总共有4个这样的块。将会发生什么是我们将有以下线程到内存位置映射(通过迭代数):

  | |核0 |线程0 | a [0] ... a [8191] < -  OK,相同的存储节点
| socket 0 |核1线程1 | a [8192] ... a [OK],相同的存储节点
| NUMA节点0 |核2 |线程2 | a [16384] ... a远程记忆
|不正确|核3 |线程3 | a [24576] ... a [32768]< - 不OK,远程存储器

| |核4 |线程4 | < idle>
|插座1 |核5 |线程5 | < idle>
| NUMA节点1 |核6 |线程6 | < idle>
| |核心7 |线程7 | < idle>

这里有两个坏处:




  • 线程4到7保持空闲并且计算能力的一半丢失;

  • 线程2和3访问非本地内存,它将需要大约两倍在这段时间内线程0和1将保持空闲。



因此,使用静态调度的一个优点是提高内存访问的本地性。缺点是调度参数的错误选择可能会破坏性能。



动态调度工作在 ,先服务的基础。具有相同线程数的两个运行可能(并且很可能)会产生完全不同的迭代空间 - >线程映射作为一个可以容易地验证:

  $ cat dyn.c 
#include< stdio.h>
#include< omp.h>

int main(void)
{
int i;

#pragma omp parallel num_threads(8)
{
#pragma omp for schedule(dynamic,1)
for(i = 0; i < i ++)
printf([1] iter%0d,tid%0d \\\
,i,omp_get_thread_num());

#pragma omp for schedule(dynamic,1)
for(i = 0; i <8; i ++)
printf([2] iter%0d,tid %0d \\\
,i,omp_get_thread_num());
}

return 0;
}

$ icc -openmp -o dyn.x dyn.c

$ OMP_NUM_THREADS = 8 ./dyn.x | sort
[1] iter 0,tid 2
[1] iter 1,tid 0
[1] iter 2,tid 7
[1] iter 3,tid 3
[1] iter 4,tid 4
[1] iter 5,tid 1
[1] iter 6,tid 6
[1] iter 7,tid 5
[2] iter 0,tid 0
[2] iter 1,tid 2
[2] iter 2,tid 7
[2] 2] iter 4,tid 6
[2] iter 5,tid 1
[2] iter 6,tid 5
[2] c $ c>

(使用 gcc 时,会出现相同的行为)



如果 static 节中的示例代码使用 dynamic 相反,将仅保留原始位置的1/70(1.4%),并且将发生远程访问的机会为69/70(98.6%)。



还有另一个原因要选择 static 动态调度 - 工作负载平衡。如果每次迭代与要完成的平均时间大不相同,则在静态情况下可能发生高工作不平衡。以用于完成迭代的时间随着迭代数线性地增长的情况为例。如果迭代空间在两个线程之间静态分割,则第二个线程将具有比第一个线程多三倍的工作量,因此,对于2/3的计算时间,第一线程将是空闲的。动态调度引入了一些额外的开销,但在特定情况下将导致更好的工作负载分配。特殊类型的动态调度是引导,其中随着工作进行给予每个任务越来越小的迭代块。



由于预编译代码可以在各种平台上运行,因此如果最终用户可以控制调度,这将是很好的。这就是为什么OpenMP提供特殊的 schedule(runtime)子句。使用 runtime 调度类型取自环境变量 OMP_SCHEDULE 的内容。这允许在不重新编译应用程序的情况下测试不同的调度类型,并允许最终用户对他或她的平台进行微调。


I'working with OpenMP C++ and I have some questions:

1) What is #pragma omp for schedule? 2) What is the difference between dynamic and static?

These commands schedule threads to work one, end this one and start another? I already searched on internet, but I can't understand what this command do. And, please, a explained example.

Thanks.

解决方案

Others have since answered most of the question but I would like to point to some specific cases where a particular scheduling type is more suited than the others. Schedule controls how loop iterations are divided among threads. Choosing the right schedule can have great impact on the speed of the application.

static schedule means that iterations blocks are mapped statically to the execution threads in a round-robin fashion. The nice thing with static scheduling is that OpenMP run-time guarantees that if you have two separate loops with the same number of iterations and execute them with the same number of threads using static scheduling, then each thread will receive exactly the same iteration range(s) in both parallel regions. This is quite important on NUMA systems: if you touch some memory in the first loop, it will reside on the NUMA node where the executing thread was. Then in the second loop the same thread could access the same memory location faster since it will reside on the same NUMA node.

Imagine there are two NUMA nodes: node 0 and node 1, e.g. a two-socket Intel Nehalem board with 4-core CPUs in both sockets. Then threads 0, 1, 2, and 3 will reside on node 0 and threads 4, 5, 6, and 7 will reside on node 1:

|             | core 0 | thread 0 |
| socket 0    | core 1 | thread 1 |
| NUMA node 0 | core 2 | thread 2 |
|             | core 3 | thread 3 |

|             | core 4 | thread 4 |
| socket 1    | core 5 | thread 5 |
| NUMA node 1 | core 6 | thread 6 |
|             | core 7 | thread 7 |

Each core can access memory from each NUMA node, but remote access is slower (1.5x - 1.9x slower on Intel) than local node access. You run something like this:

char *a = (char *)malloc(8*4096);

#pragma omp parallel for schedule(static,1) num_threads(8)
for (int i = 0; i < 8; i++)
   memset(&a[i*4096], 0, 4096);

4096 bytes in this case is the standard size of one memory page on Linux on x86 if huge pages are not used. This code will zero the whole 32 KiB array a. The malloc() call just reserves virtual address space but does not actually "touch" the physical memory (this is the default behaviour unless some other version of malloc is used, e.g. one that zeroes the memory like calloc() does). Now this array is contiguous but only in virtual memory. In physical memory half of it would lie in the memory attached to socket 0 and half in the memory attached to socket 1. This is so because different parts are zeroed by different threads and those threads reside on different cores and there is something called first touch NUMA policy which means that memory pages are allocated on the NUMA node on which the thread that first "touched" the memory page resides.

|             | core 0 | thread 0 | a[0]     ... a[4095]
| socket 0    | core 1 | thread 1 | a[4096]  ... a[8191]
| NUMA node 0 | core 2 | thread 2 | a[8192]  ... a[12287]
|             | core 3 | thread 3 | a[12288] ... a[16383]

|             | core 4 | thread 4 | a[16384] ... a[20479]
| socket 1    | core 5 | thread 5 | a[20480] ... a[24575]
| NUMA node 1 | core 6 | thread 6 | a[24576] ... a[28671]
|             | core 7 | thread 7 | a[28672] ... a[32768]

Now lets run another loop like this:

#pragma omp parallel for schedule(static,1) num_threads(8)
for (i = 0; i < 8; i++)
   memset(&a[i*4096], 1, 4096);

Each thread will access the already mapped physical memory and it will have the same mapping of thread to memory region as the one during the first loop. It means that threads will only access memory located in their local memory blocks which will be fast.

Now imagine that another scheduling scheme is used for the second loop: schedule(static,2). This will "chop" iteration space into blocks of two iterations and there will be 4 such blocks in total. What will happen is that we will have the following thread to memory location mapping (through the iteration number):

|             | core 0 | thread 0 | a[0]     ... a[8191]  <- OK, same memory node
| socket 0    | core 1 | thread 1 | a[8192]  ... a[16383] <- OK, same memory node
| NUMA node 0 | core 2 | thread 2 | a[16384] ... a[24575] <- Not OK, remote memory
|             | core 3 | thread 3 | a[24576] ... a[32768] <- Not OK, remote memory

|             | core 4 | thread 4 | <idle>
| socket 1    | core 5 | thread 5 | <idle>
| NUMA node 1 | core 6 | thread 6 | <idle>
|             | core 7 | thread 7 | <idle>

Two bad things happen here:

  • threads 4 to 7 remain idle and half of the compute capability is lost;
  • threads 2 and 3 access non-local memory and it will take them about twice as much time to finish during which time threads 0 and 1 will remain idle.

So one of the advantages for using static scheduling is that it improves locality in memory access. The disadvantage is that bad choice of scheduling parameters can ruin the performance.

dynamic scheduling works on a "first come, first served" basis. Two runs with the same number of threads might (and most likely would) produce completely different "iteration space" -> "threads" mappings as one can easily verify:

$ cat dyn.c
#include <stdio.h>
#include <omp.h>

int main (void)
{
  int i;

  #pragma omp parallel num_threads(8)
  {
    #pragma omp for schedule(dynamic,1)
    for (i = 0; i < 8; i++)
      printf("[1] iter %0d, tid %0d\n", i, omp_get_thread_num());

    #pragma omp for schedule(dynamic,1)
    for (i = 0; i < 8; i++)
      printf("[2] iter %0d, tid %0d\n", i, omp_get_thread_num());
  }

  return 0;
}

$ icc -openmp -o dyn.x dyn.c

$ OMP_NUM_THREADS=8 ./dyn.x | sort
[1] iter 0, tid 2
[1] iter 1, tid 0
[1] iter 2, tid 7
[1] iter 3, tid 3
[1] iter 4, tid 4
[1] iter 5, tid 1
[1] iter 6, tid 6
[1] iter 7, tid 5
[2] iter 0, tid 0
[2] iter 1, tid 2
[2] iter 2, tid 7
[2] iter 3, tid 3
[2] iter 4, tid 6
[2] iter 5, tid 1
[2] iter 6, tid 5
[2] iter 7, tid 4

(same behaviour is observed when gcc is used instead)

If the sample code from the static section was run with dynamic scheduling instead there will be only 1/70 (1.4%) chance that the original locality would be preserved and 69/70 (98.6%) chance that remote access would occur. This fact is often overlooked and hence suboptimal performance is achieved.

There is another reason to choose between static and dynamic scheduling - workload balancing. If each iteration takes vastly different from the mean time to be completed then high work imbalance might occur in the static case. Take as an example the case where time to complete an iteration grows linearly with the iteration number. If iteration space is divided statically between two threads the second one will have three times more work than the first one and hence for 2/3 of the compute time the first thread will be idle. Dynamic schedule introduces some additional overhead but in that particular case will lead to much better workload distribution. A special kind of dynamic scheduling is the guided where smaller and smaller iteration blocks are given to each task as the work progresses.

Since precompiled code could be run on various platforms it would be nice if the end user can control the scheduling. That's why OpenMP provides the special schedule(runtime) clause. With runtime scheduling the type is taken from the content of the environment variable OMP_SCHEDULE. This allows to test different scheduling types without recompiling the application and also allows the end user to fine-tune for his or her platform.

这篇关于OpenMP:计划的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆