OpenMP 开销 [英] OpenMP overhead

查看:89
本文介绍了OpenMP 开销的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 OpenMP 和 Intel TBB 并行化了图像卷积和 lu 分解.我正在 1-8 个内核上测试它.但是,当我通过分别使用 set_num_threads(1) 和 task_scheduler_init InitTBB(1) 指定一个线程在 OPenMP 和 TBB 中的 1 个内核上尝试它时;由于 TBB 开销,与顺序代码相比,TBB 性能显示出一些小幅下降,但令人惊讶的是,OpenMP 没有在单核上显示任何开销,并且性能与顺序代码完全相同(使用英特尔 O3 优化级别).我正在使用 OpenMP 循环的静态调度.这是现实的还是我做错了什么?

I have parallelized image convolution and lu factorization using OpenMP and Intel TBB. I am testing it on 1-8 cores. But when I try it on 1 core in OPenMP and TBB by specifying one thread using set_num_threads(1), and task_scheduler_init InitTBB(1) respectively for example; TBB performance shows some small degradation compared to sequential code due to TBB overhead, but surprisingly OpenMP doesnt show any overhead on single core and performs exactly equal to sequential code (using Intel O3 optimization level). I am using static scheduling of OpenMP loops. Is it realistic or am I doing some mistake ?

推荐答案

如果您只用一个线程运行 OpenMP 运行时,它可能不会创建任何线程.

The OpenMP runtime will probably not create any threads if you run it with just one thread.

此外,仅使用 OpenMP 并行化指令有时也会使串行代码运行得更快,因为您实际上是在为编译器提供更多信息.例如,工作共享结构告诉编译器循环的迭代是相互独立的,它可能无法自行推断,并且允许编译器使用更积极的优化策略.当然,并非总是如此,但我已经看到它发生在真实世界的代码"中.

Also, just using OpenMP parallelization directives sometimes makes also serial code run faster as you are essentially giving the compiler more information. A work-sharing construct, for example, tells the compiler that the iterations of the loop are independent of each other, which it might not have been able to deduce on its own and which allows the compiler to use more aggressive optimization strategies. Not always, of course, but I have seen it happen with "real world code".

这篇关于OpenMP 开销的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆