使用OpenMP任务生产者 - 消费者 [英] Producer-Consumer using OpenMP-Tasks

查看:631
本文介绍了使用OpenMP任务生产者 - 消费者的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想实现使用OpenMP的内任务的并行算法。
并行编程模式基于生产者 - 消费者的想法,但
因为消费者的过程比生产速度慢,我想用几个
生产者和几位消费者。
其主要思想是创造尽可能多的操作系统线程作为生产者,然后每
并行(由消费者)来完成这些将创建任务。一切
生产者将与消费者的比例号码相关联(即
numCheckers / numSeekers)。
我每片6内核上运行的英特尔双芯片服务器上的算法。
问题是,当我只用一个生产商(导引头),​​并有越来越多的
消费者(跳棋)的性能衰变的数目非常快
消费者的增长(见下表),即使核的正确数量
在100%的工作。
另一方面,如果我提高生产者的数量,平均时间
降低或至少保持稳定,即使有一个比例数
消费者的青睐。
在我看来,所有的改进是由输入的划分提出
生产者之间,而任务只窃听。但同样,我没有任何
解释一个生产者的行为。我缺少的东西
OpenMP的任务逻辑是什么?难道我做错了什么?

  -------------------------------------- -----------------------------------
|制作|消费者|时间|
-------------------------------------------------- -----------------------
| 1 | 1 | 0.642935 |
| 1 | 2 | 3.004023 |
| 1 | 3 | 5.332524 |
| 1 | 4 | 7.222009 |
| 1 | 5 | 9.472093 |
| 1 | 6 | 10.372389 |
| 1 | 7 | 12.671839 |
| 1 | 8 | 14.631013 |
| 1 | 9 | 14.500603 |
| 1 | 10 | 18.034931 |
| 1 | 11 | 17.835978 |
-------------------------------------------------- -----------------------
| 2 | 2 | 0.357881 |
| 2 | 4 | 0.361383 |
| 2 | 6 | 0.362556 |
| 2 | 8 | 0.359722 |
| 2 | 10 | 0.358816 |
-------------------------------------------------- -----------------------

我的code的主要部分是休耕:

  INT主(INT ARGC,字符** argv的){  // ...处理输入(从文件中读取,等等)  为const char * buffer_start [numSeekers]
  INT buffer_len [numSeekers]  //填充这些阵列将输入
  //我需要这样做,因为我需要重叠的缓冲区
  //正确性,所以我简单的并行,它是不够的  //这里是我创造的生产商
  INT NUM = 0;
  OMP的#pragma为平行NUM_THREADS(numSeekers)减(+:NUM)
  的for(int i = 0; I< numSeekers;我++){
      NUM + =求(buffer_start [I],buffer_len [I]);
  }  回报为(int *)NUM;
}INT寻求(为const char *缓冲区,INT N){  INT NUM = 0;  // ASIGN相同数量的消费者的用于每个生成
  OMP的#pragma平行NUM_THREADS(numCheckers / numSeekers)共享(NUM)
  {
    //只有一次,每生产商
    OMP的#pragma单
    {
      对于(INT POS = 0; POS&L​​T; N; POS + = STEP){
    如果(条件(缓冲液[POS])){
      共享的#pragma OMP任务(NUM)
      {
        //检查()是一个连续函数
        NUM + =检查(缓冲[POS]);
      }
    }
      }
      OMP的#pragma taskwait
    }
  返回NUM;
}


解决方案

观察到的行为是由于这样的事实,你没有嵌套的平行区域启用。什么情况是,在第一种情况下,你正在遭受的OpenMP的任务巨大的开销。这很可能是由于这一事实,即检查()不是比OpenMP的运行时引入的开销做了足够的工作。为什么它的行为像这样与1和2的生产者?

当只有一个生产者,外平行区域与只有一个线程执行运行。这样的平行区域是的不活跃的根据的OpenMP API规范,他们只是执行code内连续(唯一的开销是一个附加功能呼叫和通过指针访问共享变量)。在这种情况下,内部平行区域,虽然被嵌套,而嵌套并行被禁止,变成的有效的和马刺了很多任务。任务引入相对高开销和这种开销与线程数增加而增加。随着消费1内平行区域也是的不活跃的,因此连续运行无开销的任务。

当有两个生产运行,外平行区域的有效的,因此里面的平行区域呈现的不活跃的(记住 - 没有嵌套并行启用),并为没有任务正在创建所有的后果 - 求()只是连续运行。没有任务的开销和code运行几乎两倍,比1生产者/消费者1的情况下一样快。运行时间不依赖于消费者的数量,因为内并行区域始终是不活跃的,不管有多少个线程被指定。

到底有多大是一个任务,并共享变量一致接入介绍的开销?我创建了一个简单的合成基准,执行以下code:

 的for(int i = 0; I<千万,我++){
   SSUM + = SIN(我* 0.001);
}

此串行执行了不到一秒钟一个的Westmere CPU与GCC 4.7.2的默认优化级别上。

:使用简单的原子构造来保护访问共享变量 SSUM 然后我介绍任务

 的#pragma OMP并行
{
   OMP的#pragma单
   的for(int i = 0; I<千万,我++){
      OMP的#pragma任务
      {
         OMP的#pragma原子
         SSUM + = SIN(我* 0.001);
      }
   }
}

(无需 taskwait 这里因为是在平行的末尾的隐式屏障调度点区)

我还创建了一个更复杂的变体,它执行还原的相同方式的Massimiliano提出:

 的#define STRIDE 8OMP的#pragma并行
{
   OMP的#pragma单
   的for(int i = 0; I<千万,我++){
      OMP的#pragma任务
      {
         const int的IDX = omp_get_thread_num();
         ssumt [IDX * STRIDE] + = SIN(我* 0.001);
      }
   }
   OMP的#pragma taskwait   const int的IDX = omp_get_thread_num();
   OMP的#pragma原子
   SSUM + = ssumt [IDX * STRIDE]。
}

code与GCC 4.7.2编译一样:

  G ++ -fopenmp -o test.exe的test.cc

在批处理模式(所以没有其他进程可以干预)的双插槽运行它的Westmere系统(共12个内核)与不同数目的线程和不同的线程的位置上的插座,可以得到以下运行时间为codeS:

配置ATOMIC原子减少放缓
2 + 0 2.79±0.15 2.74±0,19 1,8%
1 + 1 2,72±0,21 2.51±0,22 8,4%下; -----
6 + 0 10,14±0,01 10,12±0,01 0,2%
3 + 3 22,60±0,24 22,69±0,33 -0,4%
6 + 6 37,85±0,67 38,90±0.89 -2,7-%

(由测量以秒为单位的运行时间给出omp_get_wtime(),平均超过10次/标准差也显示/; X +是配置列的平均值 X 第一插座和线程<$ C $第二个插槽上C>是线程)

正如你可以看到,从任务的开销是巨大的。这比开销更高的使用原子的不是施加还原成线程专用蓄电池得多。此外,一个原子 + = 编译到锁定比较和交换指令(<$的分配部分C $ C> LOCK CMPXCHG ) - 比调用开销高不了多少 omp_get_thread_num()每次

人们还应该注意的是,双插槽的Westmere系统是NUMA因为每个CPU有它自己的存储器和访问其他CPU经过QPI链路的存储器,并因此具有增加的延迟(和可能较低的带宽)。由于 SSUM 变量是在原子案例分享,第二处理器上运行的线程基本上实现远程请求。尽管如此,无论是$ C $之间的差异CS是可以忽略不计(除标明两个线程的情况下 - 我要调查为什么)和原子 code甚至开始超越一个与当线程数变高的还原

在多尺度NUMA系统中的原子办法同步可能会变得更加的负担,它SICE增加了开销锁定在已经慢远程访问。一种这样的系统是我们的BCS耦合节点之一。 BCS(公牛一致性开关)是从牛的专有解决方案,使用XQPI(外部QPI)一起几个Nehalem-EX处理器板卡连接成一个单一的系统,在路上(本地内存引入三级NUMA的;远程内存在同一块板上;在远程遥控板上内存)。当一个这样的系统上运行,包括4个板,4 octocore的Nehalem-EX CPU,每个CPU(128个核心计),在原子可执行1036 S启用(!!) ,而减少运行方式为1047 S,即静止的执行大约在同一时间(我的previous声明中表示,原子办法是较慢的21.5%,是由于在测量期间OS服务抖动)。这两个数字是从单次运行,因此并不十分重presentative。注意,这个系统的XQPI链路引入了非常高的延迟为板间的QPI消息,从而锁定是非常昂贵的,但不<青霉>该的昂贵。间接费用的一部分可以带走的使用减少,但它必须被正确执行。首先,降低变量的本地副本应该也是本地的NUMA节点的线程执行的地方;第二,应该找到一种方法,不要对 omp_get_thread_num电话()。这两个可以用许多不同的方式来achievied但最简单的一种是只使用线程专用变量

 静态双ssumt;
OMP的#pragma THREADPRIVATE(ssumt)OMP的#pragma并行
{
   ssumt = 0.0;   OMP的#pragma单
   的for(int i = 0; I&LT;千万,我++){
      OMP的#pragma任务
      {
         ssumt + = SIN(我* 0.001);
      }
   }
   OMP的#pragma taskwait   OMP的#pragma原子
   SSUM + = ssumt;
}

访问 ssumt ,需要的不是保护,两个任务很少在同一个线程同时执行(有进一步的调查,如果这符合OpenMP的规格)。这在code版本执行用于972秒。再次,这是不能从1036 S中的远和来自一个单一的测量只(即它可以是一个简单的统计波动),但在理论上应该是更快。

经验带回去:


  • 阅读有关嵌套平行地区OpenMP规范。嵌套并行启用通过设置环境变量 OMP_NESTED 真正或致电 omp_set_nested( 1); 。如果启用,活动嵌套的水平可以通过控制 OMP_MAX_ACTIVE_LEVELS 通过的Massimiliano指出。

  • 查找出的数据种族和尝试$ P $用最简单的方法pvent他们。不使用更复杂的方法,每次给你更好的性能。

  • 特殊系统通常需要特殊的编程。

  • 如果使用有疑问时使用线程检查器工具。英特尔有一个(商业)和Solaris Studio的甲骨文(previously称为太阳录音室)有一个太(免费;有Linux版本的,尽管产品名称)。

  • 心灵的开销!尝试拆分足够大块的工作,这样从有几百万创建并不否定获得增益并行任务的开销。

I'm trying to implement a parallel algorithm using tasks within OpenMP. The parallel programming pattern is based on the producer-consumer idea, but since the consumer process is slower than the producer, I want to use a few producers and several consumers. The main idea is to create as many OS threads as producers and then each of these will create tasks to be done in parallel (by the consumers). Every producer will be associated with a proportional number of consumers (i.e numCheckers/numSeekers). I'm running the algorithm on an Intel Dual-chip server with 6 cores per chip. The thing is that when I use only one producer(seeker) and an increasing number of consumers(checkers) the performance decays very fast as the number of consumers grows (see table below), even though the correct number of cores are working at a 100%. On the other hand, if I increase the number of producers, the average time decreases or at least stays stable, even with a proportional number of consumers. It seems to me that all the improvement is made by the division of the input among the producers, and the tasks are only bugging. But again, I don't have any explanation for the behavior with one producers. Am I missing something in the OpenMP-Task logic? Am I doing something wrong?

-------------------------------------------------------------------------
|   producers   |   consumers   |   time        |
-------------------------------------------------------------------------
|       1       |       1       |   0.642935    |
|       1       |       2       |   3.004023    |
|       1       |       3       |   5.332524    |
|       1       |       4       |   7.222009    |
|       1       |       5       |   9.472093    |
|       1       |       6       |   10.372389   |
|       1       |       7       |   12.671839   |
|       1       |       8       |   14.631013   |
|       1       |       9       |   14.500603   |
|       1       |      10       |   18.034931   |
|       1       |      11       |   17.835978   |
-------------------------------------------------------------------------
|       2       |       2       |   0.357881    |
|       2       |       4       |   0.361383    |
|       2       |       6       |   0.362556    |
|       2       |       8       |   0.359722    |
|       2       |      10       |   0.358816    |
-------------------------------------------------------------------------

The main section of my code is as fallow:

int main( int argc, char** argv) {

  // ... process the input (read from file, etc...)

  const char *buffer_start[numSeekers];
  int buffer_len[numSeekers];

  //populate these arrays dividing the input
  //I need to do this because I need to overlap the buffers for
  //correctness, so I simple parallel-for it's not enough 

  //Here is where I create the producers
  int num = 0;
  #pragma omp parallel for num_threads(numSeekers) reduction(+:num)
  for (int i = 0; i < numSeekers; i++) {
      num += seek(buffer_start[i], buffer_len[i]);
  }

  return (int*)num;
}

int seek(const char* buffer, int n){

  int num = 0;

  //asign the same number of consumers for each producer 
  #pragma omp parallel num_threads(numCheckers/numSeekers) shared(num)
  {
    //only one time for every producer
    #pragma omp single
    {
      for(int pos = 0; pos < n; pos += STEP){
    if (condition(buffer[pos])){
      #pragma omp task shared(num)
      {
        //check() is a sequential function
        num += check(buffer[pos]);
      }
    }
      }
      #pragma omp taskwait
    }
  return num;
}

解决方案

The observed behaviour is due to the fact that you do not have nested parallel regions enabled. What happens is that in the first case you are actually experiencing the HUGE overhead of OpenMP tasks. This is most likely due to the fact that check() is not doing enough work compared to the overhead the OpenMP run-time introduces. Why it behaves like so with 1 and with 2 producers?

When running with just one producer, the outer parallel region is executing with one thread only. Such parallel regions are inactive according to the OpenMP API specification and they just execute the code inside serially (the only overhead being an additional function call and accessing shared variables via pointers). In this case the inner parallel region, although being nested while nested parallelism is disabled, becomes active and spurs a lot of tasks. Tasks introduce relatively high overhead and this overhead increases with the number of threads. With 1 consumer the inner parallel region is also inactive and hence runs serially without tasks overhead.

When running with two producers, the outer parallel region is active and hence the inner parallel region is rendered inactive (remember - no nested parallelism is enabled) and as a consequence no tasks are being created at all - seek() simply runs serially. There is no tasking overhead and the code runs almost twice as fast than the 1 producer / 1 consumer case. The run time is not dependent on the number of consumers because the inner parallel region is always inactive, no matter how many threads are specified.

How big is the overhead that tasking and concerted access to shared variables introduces? I've created a simple synthetic benchmark that executes the following code:

for (int i = 0; i < 10000000; i++) {
   ssum += sin(i*0.001);
}

Serially this executes for less than a second on a Westmere CPU with the default optimisation level of GCC 4.7.2. Then I've introduced tasks using simple atomic constructs to protect the access to the shared variable ssum:

#pragma omp parallel
{
   #pragma omp single
   for (int i = 0; i < 10000000; i++) {
      #pragma omp task
      {
         #pragma omp atomic
         ssum += sin(i*0.001);
      }
   }
}

(no need for taskwait here since there is a scheduling point at the implicit barrier at the end of the parallel region)

I've also created a more complex variant that performs reduction the same way that Massimiliano has proposed:

#define STRIDE 8

#pragma omp parallel
{
   #pragma omp single
   for (int i = 0; i < 10000000; i++) {
      #pragma omp task
      {
         const int idx = omp_get_thread_num();
         ssumt[idx*STRIDE] += sin(i*0.001);
      }
   }
   #pragma omp taskwait

   const int idx = omp_get_thread_num();
   #pragma omp atomic
   ssum += ssumt[idx*STRIDE];
}

Code was compiled with GCC 4.7.2 like:

g++ -fopenmp -o test.exe test.cc

Running it in batch mode (so no other processes could intervene) on a dual-socket Westmere system (12 cores in total) with different number of threads and different threads placement on the sockets, one obtains the following run times for both codes:

Configuration   ATOMIC       Reduction    ATOMIC slowdown
2 + 0            2,79 ±0,15   2,74 ±0,19   1,8%
1 + 1            2,72 ±0,21   2,51 ±0,22   8,4% <-----
6 + 0           10,14 ±0,01  10,12 ±0,01   0,2%
3 + 3           22,60 ±0,24  22,69 ±0,33  -0,4%
6 + 6           37,85 ±0,67  38,90 ±0,89  -2,7%

(run times are given in seconds as measured by omp_get_wtime(), averaged over 10 runs /std. deviation also shown/; x + y in the Configuration column means x threads on the first socket and y threads on the second socket)

As you can see, the overhead from the tasks is enormous. It is much higher than the overhead from using atomic instead of applying reduction to thread-private accumulators. Besides, the assignment part of an atomic with += compiles to a locked compare-and-exchange instruction (LOCK CMPXCHG) - not much higher overhead than calling omp_get_thread_num() each time.

One should also note that the dual-socket Westmere system is NUMA since each CPU has its own memory and accesses to the memory of the other CPU go through the QPI link and are hence with increased latency (and possibly lower bandwidth). As the ssum variable is shared in the atomic case, threads that run on the second processor are essentially making remote requests. Still, the difference between both codes is negligible (except for the marked two-threads case - I have to investigate why) and the atomic code even starts to outperform the one with the reduction when the number of threads gets higher.

On multiscale NUMA systems the synchronisation in the atomic approach might become more of a burden, sice it adds locking overhead to the already slower remote accesses. One such system is one of our BCS-coupled nodes. BCS (Bull Coherence Switch) is a proprietary solution from Bull that uses XQPI (eXternal QPI) to link together several Nehalem-EX boards into a single system, introducing three levels of NUMA in the way (local memory; remote memory on the same board; remote memory on a remote board). When running on one such system, consisting of 4 boards with 4 octocore Nehalem-EX CPUs each (128 cores in total), the atomic executable runs for 1036 s (!!), while the reduction approach runs for 1047 s, i.e. both still execute for about the same time (my previous statement that the atomic approach is 21,5% slower was due to OS services jitter during the measurement). Both numbers are from single runs and hence are not quite representative. Note that on this system the XQPI link introduces very high latency for inter-board QPI messages and thus locking is very expensive, but not that expensive. Part of the overhead can be taken away by using reduction, but it has to be implemented properly. First, local copies of the reduction variable should be also local to the NUMA node where the thread executes and second, one should find a way to not make calls to omp_get_thread_num(). These two could be achievied in many different ways but the easiest one is just to use threadprivate variables:

static double ssumt;
#pragma omp threadprivate(ssumt)

#pragma omp parallel
{
   ssumt = 0.0;

   #pragma omp single
   for (int i = 0; i < 10000000; i++) {
      #pragma omp task
      {
         ssumt += sin(i*0.001);
      }
   }
   #pragma omp taskwait

   #pragma omp atomic
   ssum += ssumt;
}

Access to ssumt needs no protection as two tasks seldom execute simultaneously in the same thread (have to futher investigate if this conforms to the OpenMP specs). This version of the code executes for 972 s. Once again this is not that far from 1036 s and comes from a single measurement only (i.e. it could be simply a statistical fluctuation), but in theory it should be faster.

Lessons to take home:

  • Read the OpenMP specification about nesting parallel regions. Nested parallelism is enabled either by setting the environment variable OMP_NESTED to true or by calling omp_set_nested(1);. If enabled, the level of active nesting can be controlled by the OMP_MAX_ACTIVE_LEVELS as pointed out by Massimiliano.
  • Look out for data races and try to prevent them using the most simple approach. Not every time using a more complex approach gives you better performance.
  • Special systems usually require special programming.
  • When in doubt use a thread-checker tool if available. Intel has one (commercial) and Solaris Studio (previously known as Sun Studio) from Oracle has one too (free; has Linux version despite the product name).
  • Mind the overhead! Try to split the job in big enough chunks so that the overhead from having millions of tasks created does not negate the obtained parallel gain.

这篇关于使用OpenMP任务生产者 - 消费者的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆