为什么并行化会大幅降低性能? [英] Why would parallelization decrease performance so dramatically?

查看:142
本文介绍了为什么并行化会大幅降低性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个OpenMP程序(数千行,不可能在这里重现),工作原理如下:



它由工作线程和任务队列组成。 br>
任务包括卷积;每次一个工作线程从工作队列中弹出一个任务,它执行所需的卷积,并且可选地将更多的卷积推送到队列上。

(没有特定的主线程;所有工人是相等的。 )



当我在自己的机器上运行此程序时( 4核HT非NUMA Core i7 ),我得到的运行时间是:

 (#threads:running time)
1:5374 ms
2:2830 ms
3:2147 ms
4:1723 ms
5:1379 ms
6:1281 ms
7:1217 ms
8:1179 ms

但是,当我在NUMA 48-核心AMD皓龙6168机器上运行它时,我得到这些运行时间:

  1:9252 ms 
2:5101 ms
3:3651 ms
4:2821 ms
5:2364 ms
6: 2062 ms
7:1954 ms
8:1725 ms
9:1564 ms
10:1513 ms
11:1508 ms
12:1796 ms - - 为什么会变得更糟?
13:1718 ms
14:1765 ms
15:2799 ms< ------为什么会得到*这么多*更糟?
16:2189 ms
17:3661 ms
18:3967 ms
19:4415 ms
20:3089 ms
21:5102 ms
22:3761 ms
23:5795 ms
24:4202 ms

这些结果是非常一致的,它不是机器上的负载的工件。

所以我不明白:

什么可能导致性能下降这么多后12核?



我会理解,如果性能在某种程度上饱和了(我可以归咎于有限的内存带宽) ,但我不明白如何通过添加更多线程下降从1508 ms下降到5795 ms。



解决方案

这种情况可能很难弄清楚。一个关键是查看内存区域。没有看到你的代码,不可能说完全错误,但我们可以讨论一些多线程不太好的事情:



在所有的NUMA系统,当存储器位于处理器X和运行在处理器Y(其中X& Y不是相同的处理器)上的代码时,每个存储器访问将对性能有害。所以,在正确的NUMA节点上分配内存肯定会有帮助。 (这可能需要一些特殊的代码,例如设置相似性掩码和至少提示到OS /运行时系统,你想要Numa感知分配)。至少,确保你不要简单地处理一个由第一个线程,然后开始更多的线程分配的大数组。



另一件事更糟糕的是共享或错误共享内存 - 所以如果两个或多个处理器使用相同的缓存行,您将在这两个处理器之间获得乒乓匹配,其中每个处理器将执行我想在内存地址A,获取内存内容,更新它,然后下一个处理器将做同样的事情。



事实上,在12个线程的结果是坏的似乎表明它是与套接字 - 无论是共享数据,还是数据位于错误节点。在12个线程,很可能你开始使用第二个套接字(更多),这将使这些问题更明显。



为了获得最佳性能,您需要在本地节点上分配内存,不需要共享,也不需要锁定。您的第一组结果看起来像是不是理想。我有一些(绝对非共享)代码,为处理器数量提供准确的n倍,直到我用完处理器(不幸的是,我的机器只有4核心,所以它不是很好,但它仍然是4倍更好如果我有一个48或64核的机器,它会产生48或64更好的结果计算奇怪的数字)。



编辑:



Socket问题是两件事:


  1. 内存位置:内存基本上是连接到每个套接字,所以如果内存是从属于上一个套接字的区域分配的,读取内存。


  2. 缓存/共享:在处理器中,有快速链接来共享数据(通常是底层共享缓存 ,这允许套接字中的核心比不同套接字中的核心更有效地共享数据。


这些都是为了维修汽车,但是没有自己的工具箱,时间你需要一个工具,你必须问你旁边的同事一个螺丝刀,15毫米扳手,或任何你需要的。然后,当工作区域有点满时,将工具送回。这不是一个非常有效的工作方式...如果你有自己的工具(至少最常见的一个 - 那些特殊的扳手,你只使用一个月一次不是一个大问题,会更好,但你通常的10,12和15mm扳手和一些螺丝刀,肯定)。当然,如果有四个机制,所有共享相同的工具箱,它会变得更糟。这是在四插槽系统中具有在一个节点上分配的所有内存的情况。



现在想象一下,你有一个扳手箱,只有一个机械师可以使用扳手箱,所以如果你需要一个12mm的扳手,你有等待你旁边的家伙完成使用15mm扳手。这是如果你有假高速缓存共享 - 处理器没有真正使用相同的值,但是因为在高速缓存行中有多个东西,处理器共享缓存行(扳手框) 。


I have an OpenMP program (thousands of lines, impossible to reproduce here) that works as follows:

It consists of worker threads along with a task queue.
A task consists of a convolution; every time a worker thread pops off a task from the work queue, it performs the required convolution and optionally pushes more convolutions onto the queue.
(There is no specific "master" thread; all workers are equal.)

When I run this program on my own machine (4-core HT non-NUMA Core i7), the running times I get are:

(#threads: running time)
 1: 5374 ms
 2: 2830 ms
 3: 2147 ms
 4: 1723 ms
 5: 1379 ms
 6: 1281 ms
 7: 1217 ms
 8: 1179 ms

This makes sense.

However, when I run it on a NUMA 48-core AMD Opteron 6168 machine, I get these running times:

 1: 9252 ms
 2: 5101 ms
 3: 3651 ms
 4: 2821 ms
 5: 2364 ms
 6: 2062 ms
 7: 1954 ms
 8: 1725 ms
 9: 1564 ms
10: 1513 ms
11: 1508 ms
12: 1796 ms  <------ why did it get worse?
13: 1718 ms
14: 1765 ms
15: 2799 ms  <------ why did it get *so much* worse?
16: 2189 ms
17: 3661 ms
18: 3967 ms
19: 4415 ms
20: 3089 ms
21: 5102 ms
22: 3761 ms
23: 5795 ms
24: 4202 ms

These results are pretty consistent, it's not an artifact of load on the machine.
So I don't understand:
What could cause the performance to drop so much after 12 cores?

I would understand if the performance saturated at some level (I could blame it on limited memory bandwidth), but I don't understand how it can drop from 1508 ms to 5795 ms by adding more threads.

How is this possible?

解决方案

These sort of situations can be quite hard to figure out. One key is to look at memory locality. Without seeing your code, it's impossible to say EXACTLY what is going wrong, but we can discuss some of the things that amke "multithreading less good":

In all NUMA systems, when the memory is located with processor X and the code running on processor Y (where X & Y aren't the same processor), every memory access will be bad for performance. So, allocating memory on the right NUMA node will certainly help. (This may require some special code, such as setting affinity masks and at least hinting to the OS/Runtime Systems that you want Numa-aware allocations). At the very least, ensure that you don't simply work on one large array that is allocated by the "first thread, then start lots more threads".

Another thing that is even worse is sharing or false sharing of memory - so if two or more processors are using the same cache-line, you will get a ping-pong match between those two processors, where each processor will do "I want memory at address A", get hold of the memory content, update it, and then the next processor will do the same thing.

The fact that results gets bad just at 12 threads seem to indicate that it's to do with "sockets" - either you are sharing data, or the data is located "on the wrong node". At 12 threads, it's likely that you start using the second socket (more), which will make these sort of problems more apparent.

For best performance, you need memory to be allocated on the local node, no sharing, and no locking. Your first set of results also look like they are not "ideal". I have some (absolutely non-sharing) code that gives exactly n-times better for number of processors, until I run out of processors (unfortunately, my machine only has 4 cores, so it's not very much better, but it's still 4x better than 1 core, and if I ever got my hands on a 48 or 64-core machine, it would produce 48 or 64 better results in calculating "weird numbers").

Edit:

The "Socket issue" is two things:

  1. Memory locality: Basically, memory is attached to each socket, so if the memory is allocated from the region belonging to the "previous" socket, then you get extra latency reading the memory.

  2. Cache/sharing: Within a processor, there are "fast" links to share data (and often a "bottom level shared cache", e.g. L3 cache), which allows for the cores within a socket to share data more efficiently than with those in a different socket.

All this amounts to something like working on servicing cars, but you don't have your own toolbox, so every time you need a tool, you have to ask your colleague next to you for a screwdriver, 15mm spanner, or whatever you need. And then give the tools back when your work area gets a bit full. It's not a very efficient way of working... It would be much better if you had tools of your own (at least the most common one - one of those special spanners that you only use once a month isn't a big issue, but your common 10, 12 and 15mm spanners and a few screwdrivers, for sure). And of course, it would get even worse if there are four mechanics, all sharing the same toolbox. This is the case where you have "all memory allocated on one node" in a four socket system.

Now imagine that you have a "box of spanners", and only one of the mechanics can use the box of spanners, so if you need a 12mm spanner, you have to wait for the guy next to you to finish using the 15mm spanner. This is what happens if you have "false cache-sharing" - the processor isn't really using the same value, but because there are more than one "thing" in the cacheline, the processors are sharing the cacheline (box of spanners).

这篇关于为什么并行化会大幅降低性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆