由于OpenMP超线程造成的性能差:如何将线程绑定到内核 [英] Poor performance due to hyper-threading with OpenMP: how to bind threads to cores

查看:233
本文介绍了由于OpenMP超线程造成的性能差:如何将线程绑定到内核的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发大型密集矩阵乘法码。当我剖析代码时,它有时会占据我四核心系统峰值触发器的75%,其他时间则占到36%左右。在执行代码之间效率不会改变。它或者从75%开始,继续保持这个效率,或者从36%开始,并且继续保持这个效率。

我把问题追溯到超线程,事实上我将线程数设置为4而不是缺省值8。 当我禁用BIOS中的超线程功能时,我始终可以获得75%的效率(或者至少我从来没有看到过急剧下降到36%)。

在我调用任何并行代码之前,我会执行 omp_set_num_threads(4)。在我运行我的代码之前,我也试过 export OMP_NUM_THREADS = 4 ,但它似乎是等价的。

我不想在BIOS中禁用超线程。我想我需要将四个线程绑定到四个内核。我已经测试了 GOMP_CPU_AFFINITY 的一些不同情况,但到目前为止我仍然有问题,有时候效率有36%。 超线程和核心的映射是什么?线程0和线程1对应于相同的核心和线程2并且线程3是另一个核心?



如何在没有线程的情况下将线程绑定到每个核心迁移,以便我不必禁用BIOS中的超线程?也许我需要使用

编辑:
这似乎目前运作良好

 export GOMP_CPU_AFFINITY =0 1 2 3 4 5 6 7

  export GOMP_CPU_AFFINITY =0-7

编辑:
这似乎也很好运行

pre $ export OMP_PROC_BIND = true

编辑:
这些o p> 也可以正常工作(gemm是我可执行文件的名称)

pre $ numactl -C 0,1,2, 3 ./gemm

  taskset -c 0,1,2,3 ./gemm 


解决方案

这不是对您的问题的直接回答,但可能值得查看:显然,超线程可能会导致缓存崩溃。你有没有试过检查valgrind,看看是什么样的问题导致你的问题?在每个线程的堆栈顶部分配一些垃圾可能会有一个快速的解决方案,以便您的线程不会最终将其他缓存线踢出去。



它看起来像你的CPU是4路组合关联,因此,在8个线程中,你可能会遇到一些非常不幸的对齐访问,这并不是疯狂的想法。如果您的矩阵按照缓存大小的倍数对齐,并且如果有一对线程访问缓存区域 - 多个缓存区域,则第三个线程偶然读取的任何内容都足以导致冲突未命中。

对于快速测试 - 如果您将输入矩阵更改为不是缓存大小的倍数(因此它们不再与边界对齐)以及你的问题消失了,那么你很有可能会遇到冲突失误。

I am developing large dense matrix multiplication code. When I profile the code it sometimes gets about 75% of the peak flops of my four core system and other times gets about 36%. The efficiency does not change between executions of the code. It either starts at 75% and continues with that efficiency or starts at 36% and continues with that efficiency.

I have traced the problem down to hyper-threading and the fact that I set the number of threads to four instead of the default eight. When I disable hyper-threading in the BIOS I get about 75% efficiency consistently (or at least I never see the drastic drop to 36%).

Before I call any parallel code I do omp_set_num_threads(4). I have also tried export OMP_NUM_THREADS=4 before I run my code but it seems to be equivalent.

I don't want to disable hyper-threading in the BIOS. I think I need to bind the four threads to the four cores. I have tested some different cases of GOMP_CPU_AFFINITY but so far I still have the problem that the efficiency is 36% sometimes. What is the mapping with hyper-threading and cores? E.g. do thread 0 and thread 1 correspond to the the same core and thread 2 and thread 3 another core?

How can I bind the threads to each core without thread migration so that I don't have to disable hyper-threading in the BIOS? Maybe I need to look into using sched_setaffinity?

Some details of my current system: Linux kernel 3.13, GCC 4.8,Intel Xeon E5-1620 (four physical cores, eight hyper-threads).

Edit: This seems to be working well so far

export GOMP_CPU_AFFINITY="0 1 2 3 4 5 6 7"

or

export GOMP_CPU_AFFINITY="0-7"

Edit: This seems also to work well

export OMP_PROC_BIND=true

Edit: These options also work well (gemm is the name of my executable)

numactl -C 0,1,2,3 ./gemm

and

taskset -c 0,1,2,3 ./gemm

解决方案

This isn't a direct answer to your question, but it might be worth looking in to: apparently, hyperthreading can cause your cache to thrash. Have you tried checking out valgrind to see what kind of issue is causing your problem? There might be a quick fix to be had from allocating some junk at the top of every thread's stack so that your threads don't end up kicking each others cache lines out.

It looks like your CPU is 4-way set associative so it's not insane to think that, across 8 threads, you might end up with some really unfortunately aligned accesses. If your matrices are aligned on a multiple of the size of your cache, and if you had pairs of threads accessing areas a cache-multiple apart, any incidental read by a third thread would be enough to start causing conflict misses.

For a quick test -- if you change your input matrices to something that's not a multiple of your cache size (so they're no longer aligned on a boundary) and your problems disappear, then there's a good chance that you're dealing with conflict misses.

这篇关于由于OpenMP超线程造成的性能差:如何将线程绑定到内核的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆