跨多个功能的OpenMP线程团队的持久性 [英] Persistance of OpenMP thread teams across functions
问题描述
我有一个简单的程序,用于物理模拟。我想知道如何在OpenMP中实现某个线程模式。
int main()
{
#define steps(100000)
for(int t = 0; t {
firstParallelLoop();
secondParallelLoop();
if(!(t%100))
{
checkpoint();
$ b $ void firstParallelLoop()
{//在另一个文件中c
#pragma omp parallel for
for( int i = 0; i< sizeOfSim; i ++)
{
//一些原子浮点运算符。
$ b $ p
$ b前一段时间,我使用pthreads,并且获得了1.7加速在我的dualcore笔记本电脑上。使用OpenMP时,我似乎无法获得任何加速。我怀疑问题是线程组/池正在快速创建和销毁,并带来灾难。
在我的pthreads实现中,我需要确保没有创建新线程,并且我的程序表现为客户端服务器。在pthreads方案中,main()是一个服务器,调用firstParallelLoop会释放触发线程重新处理数据的互斥体/信号量。
当我查看CPU利用率时,我预计它会超过30%(4核心,2是HT),但它保持在27 ...
p>
如何让OpenMP执行类似的操作?如何让OpenMP重用我的线程?
code>通过类似于线程池的方式在POSIX系统上实现线程组 - 线程仅在遇到第一个并行区域时创建,每个线程运行一个无限工作循环。进入和退出平行区域是通过障碍来实现的。默认情况下, libgomp
使用忙等待和睡眠来实现屏障。忙等待的数量由 OMP_WAIT_POLICY
环境变量控制。如果未指定,那么等待屏障的线程将忙于等待300000个自旋(3毫秒,100000个自旋/毫秒),然后进入休眠状态。如果 OMP_WAIT_POLICY
设置为活动
,则忙等待时间增加到30000000000转(100000转5分钟/秒)。您可以通过将 GOMP_SPINCOUNT
变量设置为繁忙周期数( libgomp
假定的)来优化繁忙等待时间100000个自旋/毫秒,但可能因CPU而异5倍)。您可以完全禁用这样的睡眠: OMP_WAIT_POLICY =活动GOMP_SPINCOUNT =无限OMP_NUM_THREADS = ... ./program
这会以某种方式改善线程团队的开始时间,但是会以CPU时间为代价,因为空闲线程不会闲置,而是忙碌等待。
为了消除开销,您应该以更适合OpenMP的方式重写您的程序。您的示例代码可以像这样重写:
int main()
{
#define steps( 100000)
#pragma omp parallel
{
for(int t = 0; t {
firstParallelLoop();
secondParallelLoop();
if(!(t%100))
{
#pragma omp master
checkpoint();
#pragma omp barrier
}
}
}
}
void firstParallelLoop()
{//在另一个文件中.c
#pragma omp for
for(int i = 0; i< sizeOfSim; i ++)
{
//一些原子浮点运算符。
请注意以下两点:
- 在主程序中插入并行区域。不过,不是并行。团队中的所有线程都会执行外部循环
steps
次。
- 只有使用
omp 时,并行 firstParallelLoop
中的c>循环是并行的。因此,如果在OpenMP平行外部调用,并且在平行区域内调用它时,它将作为串行循环执行。对于 secondParallelLoop
中的循环应该做同样的事情。
主循环用于确保其他线程在开始下一次迭代之前等待检查点完成。
I have a simple program that I am using for physics simulation. I want to know how to implement a certain threading paradigm in OpenMP.
int main()
{
#define steps (100000)
for (int t = 0;t < steps; t++)
{
firstParallelLoop();
secondParallelLoop();
if (!(t%100))
{
checkpoint();
}
}
}
void firstParallelLoop()
{// In another file.c
#pragma omp parallel for
for (int i = 0; i < sizeOfSim;i++)
{
//Some atomic floating point ops.
}
}
Formerly, I was using pthreads and got a 1.7 speedup on my dualcore laptop. I can't seem to get any speedup when using OpenMP. I suspect the problem is that the thread groups/pools are rapidly being created and destroyed with disasterous effect.
In my pthreads implementations I needed to ensure that no new threads were created, and that my program behaved as a client-server. In the pthreads scheme, the main() was a server and calls to firstParallelLoop would release mutexes/semaphores that triggered the thread to reprocess the data.
When I look at CPU utilization I expect it to be over the 30% mark (4 core, 2 are HT), but it stays around 27...
How do I get OpenMP to do something similar? How can I tell OpenMP to reuse my threads?
解决方案 The GCC OpenMP run-time libgomp
implements thread teams on POSIX systems by something akin to a thread pool - threads are only created when the first parallel region is encountered, with each thread running an infinite work loop. Entering and exiting a parallel region is implemented with barriers. By default libgomp
uses a combination of busy-waiting and sleeping to implement barriers. The amount of busy-waiting is controlled by the OMP_WAIT_POLICY
environment variable. If it is not specified, threads that wait on a barrier would busy-wait for 300000 spins (3 ms at 100000 spins/msec) and then would go into sleeping state. If OMP_WAIT_POLICY
is set to active
, then the busy-wait time is increased to 30000000000 spins (5 mins at 100000 spins/sec). You can fine tune the busy-waiting time by setting the GOMP_SPINCOUNT
variable to the number of busy cycles (libgomp
assumes about 100000 spins/msec but it could vary by a factor of 5 depending on the CPU). You can fully disable sleeping like this:
OMP_WAIT_POLICY=active GOMP_SPINCOUNT=infinite OMP_NUM_THREADS=... ./program
This would somehow improve the thread team starting time, but at the expense of CPU time as idle threads would not idle but rather busy-wait.
In order to remove the overhead you should rewrite your program in more OpenMP-friendly way. Your example code could be rewritten like this:
int main()
{
#define steps (100000)
#pragma omp parallel
{
for (int t = 0; t < steps; t++)
{
firstParallelLoop();
secondParallelLoop();
if (!(t%100))
{
#pragma omp master
checkpoint();
#pragma omp barrier
}
}
}
}
void firstParallelLoop()
{// In another file.c
#pragma omp for
for (int i = 0; i < sizeOfSim; i++)
{
//Some atomic floating point ops.
}
}
Note the following two things:
- A parallel region is inserted in the main program. It is not a
parallel for
though. All threads in the team would execute the outer loop steps
times.
- The
for
loop in firstParallelLoop
is made parallel by using omp for
only. Thus it will execute as a serial loop if called outside an OpenMP parallel and as parallel when called from inside a parallel region. The same should be done for the loop in secondParallelLoop
.
The barrier in the main loop is used to ensure that other threads would wait for the checkpoint to finish before starting the next iteration.
这篇关于跨多个功能的OpenMP线程团队的持久性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!