跨多个功能的OpenMP线程团队的持久性 [英] Persistance of OpenMP thread teams across functions

查看:193
本文介绍了跨多个功能的OpenMP线程团队的持久性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的程序,用于物理模拟。我想知道如何在OpenMP中实现某个线程模式。

  int main()
{
#define steps(100000)
for(int t = 0; t {
firstParallelLoop();
secondParallelLoop();
if(!(t%100))
{
checkpoint();


$ b $ void firstParallelLoop()
{//在另一个文件中c
#pragma omp parallel for
for( int i = 0; i< sizeOfSim; i ++)
{
//一些原子浮点运算符。




$ b $ p
$ b前一段时间,我使用pthreads,并且获得了1.7加速在我的dualcore笔记本电脑上。使用OpenMP时,我似乎无法获得任何加速。我怀疑问题是线程组/池正在快速创建和销毁,并带来灾难。



在我的pthreads实现中,我需要确保没有创建新线程,并且我的程序表现为客户端服务器。在pthreads方案中,main()是一个服务器,调用firstParallelLoop会释放触发线程重新处理数据的互斥体/信号量。



当我查看CPU利用率时,我预计它会超过30%(4核心,2是HT),但它保持在27 ...

p>

如何让OpenMP执行类似的操作?如何让OpenMP重用我的线程?

code>通过类似于线程池的方式在POSIX系统上实现线程组 - 线程仅在遇到第一个并行区域时创建,每个线程运行一个无限工作循环。进入和退出平行区域是通过障碍来实现的。默认情况下, libgomp 使用忙等待和睡眠来实现屏障。忙等待的数量由 OMP_WAIT_POLICY 环境变量控制。如果未指定,那么等待屏障的线程将忙于等待300000个自旋(3毫秒,100000个自旋/毫秒),然后进入休眠状态。如果 OMP_WAIT_POLICY 设置为活动,则忙等待时间增加到30000000000转(100000转5分钟/秒)。您可以通过将 GOMP_SPINCOUNT 变量设置为繁忙周期数( libgomp 假定的)来优化繁忙等待时间100000个自旋/毫秒,但可能因CPU而异5倍)。您可以完全禁用这样的睡眠:

  OMP_WAIT_POLICY =活动GOMP_SPINCOUNT =无限OMP_NUM_THREADS = ... ./program 

这会以某种方式改善线程团队的开始时间,但是会以CPU时间为代价,因为空闲线程不会闲置,而是忙碌等待。



为了消除开销,您应该以更适合OpenMP的方式重写您的程序。您的示例代码可以像这样重写:

  int main()
{
#define steps( 100000)
#pragma omp parallel
{
for(int t = 0; t {
firstParallelLoop();
secondParallelLoop();
if(!(t%100))
{
#pragma omp master
checkpoint();
#pragma omp barrier
}
}
}
}
void firstParallelLoop()
{//在另一个文件中.c
#pragma omp for
for(int i = 0; i< sizeOfSim; i ++)
{
//一些原子浮点运算符。


请注意以下两点:




  • 在主程序中插入并行区域。不过,不是并行。团队中的所有线程都会执行外部循环 steps 次。

  • 只有使用 omp 时,并行 firstParallelLoop 中的c>循环是并行的。因此,如果在OpenMP平行外部调用,并且在平行区域内调用它时,它将作为串行循环执行。对于 secondParallelLoop 中的循环应该做同样的事情。



主循环用于确保其他线程在开始下一次迭代之前等待检查点完成。


I have a simple program that I am using for physics simulation. I want to know how to implement a certain threading paradigm in OpenMP.

int main()
{
#define steps (100000)
   for (int t = 0;t < steps; t++)
   {
     firstParallelLoop();
     secondParallelLoop();
     if (!(t%100))
     {
        checkpoint();
     }
   }
}
void firstParallelLoop()
{// In another file.c
  #pragma omp parallel for
   for (int i = 0; i < sizeOfSim;i++)
   {
     //Some atomic floating point ops.
   }
}

Formerly, I was using pthreads and got a 1.7 speedup on my dualcore laptop. I can't seem to get any speedup when using OpenMP. I suspect the problem is that the thread groups/pools are rapidly being created and destroyed with disasterous effect.

In my pthreads implementations I needed to ensure that no new threads were created, and that my program behaved as a client-server. In the pthreads scheme, the main() was a server and calls to firstParallelLoop would release mutexes/semaphores that triggered the thread to reprocess the data.

When I look at CPU utilization I expect it to be over the 30% mark (4 core, 2 are HT), but it stays around 27...

How do I get OpenMP to do something similar? How can I tell OpenMP to reuse my threads?

解决方案

The GCC OpenMP run-time libgomp implements thread teams on POSIX systems by something akin to a thread pool - threads are only created when the first parallel region is encountered, with each thread running an infinite work loop. Entering and exiting a parallel region is implemented with barriers. By default libgomp uses a combination of busy-waiting and sleeping to implement barriers. The amount of busy-waiting is controlled by the OMP_WAIT_POLICY environment variable. If it is not specified, threads that wait on a barrier would busy-wait for 300000 spins (3 ms at 100000 spins/msec) and then would go into sleeping state. If OMP_WAIT_POLICY is set to active, then the busy-wait time is increased to 30000000000 spins (5 mins at 100000 spins/sec). You can fine tune the busy-waiting time by setting the GOMP_SPINCOUNT variable to the number of busy cycles (libgomp assumes about 100000 spins/msec but it could vary by a factor of 5 depending on the CPU). You can fully disable sleeping like this:

OMP_WAIT_POLICY=active GOMP_SPINCOUNT=infinite OMP_NUM_THREADS=... ./program

This would somehow improve the thread team starting time, but at the expense of CPU time as idle threads would not idle but rather busy-wait.

In order to remove the overhead you should rewrite your program in more OpenMP-friendly way. Your example code could be rewritten like this:

int main()
{
#define steps (100000)
   #pragma omp parallel
   {
      for (int t = 0; t < steps; t++)
      {
         firstParallelLoop();
         secondParallelLoop();
         if (!(t%100))
         {
            #pragma omp master
            checkpoint();
            #pragma omp barrier
         }
      }
   }
}
void firstParallelLoop()
{// In another file.c
   #pragma omp for
   for (int i = 0; i < sizeOfSim; i++)
   {
      //Some atomic floating point ops.
   }
}

Note the following two things:

  • A parallel region is inserted in the main program. It is not a parallel for though. All threads in the team would execute the outer loop steps times.
  • The for loop in firstParallelLoop is made parallel by using omp for only. Thus it will execute as a serial loop if called outside an OpenMP parallel and as parallel when called from inside a parallel region. The same should be done for the loop in secondParallelLoop.

The barrier in the main loop is used to ensure that other threads would wait for the checkpoint to finish before starting the next iteration.

这篇关于跨多个功能的OpenMP线程团队的持久性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆