NUMA计算机上的共享库瓶颈 [英] Shared Library bottleneck on NUMA machine

查看:99
本文介绍了NUMA计算机上的共享库瓶颈的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用NUMA机器(SGI UV 1000)同时运行大量数值模拟,每个模拟都是使用4个核的OpenMP作业.但是,运行这些作业中的大约100多个会严重影响性能.我们关于发生这种情况的理论是,软件所需的共享库仅加载一次到机器的全局内存中,然后系统遇到通信瓶颈,因为所有进程都在访问单个节点上的内存.

I'm using a NUMA machine (an SGI UV 1000) to run a large number of numerical simulations at the same time, each of which is an OpenMP job using 4 cores. However, running more than around 100 of these jobs results in a significant performance hit. Our theory as to why this happens is that the shared libraries required by the software are loaded only once into the machine's global memory, and the system is then experiencing a communication bottleneck as all processes are accessing memory on a single node.

这是一个旧软件,没有修改范围,并且静态make选项不会静态链接其所需的所有库.从我看来,最方便的解决方案是以某种方式强制系统在每个进程或节点(我在每个进程或节点上运行3个进程)上加载所需共享库的新副本.能够找到如何做到这一点.谁能告诉我该怎么做,或者对如何解决这个问题有其他建议?

It's an old software with limited to no scope for modification and the static make option does not statically link all the libraries it needs. The most convenient solution, from what I can see, would be to somehow force the system to load a new copy of the required shared libraries on each process or node (on each of which I am running 3 processes), but I haven't been able to find out how to do this. Can anyone tell me how to do this, or have any other suggestions about how to solve this problem?

推荐答案

软件所需的共享库仅一次加载到计算机的全局内存中,

the shared libraries required by the software are loaded only once into the machine's global memory,

据我所知,这是Linux的当前行为.共享库仅加载到一组物理内存上,并且只能加载在单个节点上.

As I know, this is the current behavior of Linux. Shared library is loaded only to one set of physical memory, and only on single node.

然后,由于所有进程都在访问单个节点上的内存,因此系统遇到了通信瓶颈.

and the system is then experiencing a communication bottleneck as all processes are accessing memory on a single node.

如评论中所述,库中的指令应该缓存在每个处理器中,因此只有从缓存中清除了库中的活动代码(例如,有许多不同的代码在工作)时,才可能出现瓶颈.

As said in comments, instructions from library should be cached in every processor, so there can be bottleneck only if active code from library is wiped from cache (e.g. there is a lot of different code working).

您应该使用硬件性能计数器(高速缓存未命中,节点间NUMA内存访问计数)来验证您的理论.

You should to verify your theory by using hardware performance counters (misses from caches, inter-node NUMA memory access count).

在NUMA上存储带有多个副本的一些数据的机制,在Linux上称为复制".内核,可执行文件或其共享库的代码称为文本.因此,您想要的是共享库的文本复制".我认为对于内核代码而言,文本复制更容易.

The mechanism of storing some data with several copies on NUMA called "replication" on linux. And code of kernel, executable or of its shared libraries is called text. So, what you want is "text replication for shared libraries". I think that text replication is easier for kernel codes.

我能够从2003年找到一些实验性补丁来进行此类文本复制,例如 http://lwn.net/Articles/63512/( [RFC] [PATCH] NUMA用户页面复制),作者是IBM的Dave Hansen.该补丁似乎被拒绝了.

I was able to find some experimental patches from 2003 for doing such text replication, e.g. http://lwn.net/Articles/63512/ ([RFC][PATCH] NUMA user page replication) by Dave Hansen, IBM. This patch seems to be refused.

此技术的更现代的版本(2007年)是页面缓存的复制: http://lwn.net/Articles/223056/( mm:复制的页面缓存).还介绍了有关他的方法: http://ondioline.org/~paul/pagecachereplication.pdf .这将起作用,因为所有文件(可执行文件和共享库)都存储在页面缓存中.但是即使是此补丁,我也无法在当前内核中找到它.

More modern (2007) variant of this technique is replication of pagecache: http://lwn.net/Articles/223056/ (mm: replicated pagecache) by Nick Piggin, SUSE. There is also presentation about his method: http://ondioline.org/~paul/pagecachereplication.pdf. This will work because all files are stored in pagecache, both executables and shared libraries. But even for this patch I can't find it in the current kernel.

在SGI上,有更多的复制需求(它们比典型的内核开发人员拥有更多的NUMA计算机),因此可以有一些附加的补丁程序.有针对NUMA的SGI应用程序调整手册:使用dplace命令"部分中的实用程序.它具有文本复制选项:

On SGI there is more needs of replications (they have more NUMA machines that typical kernel developer), so there is can be some addition patches. There is an SGI's application tuning manual for NUMA: http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/linux/bks/SGI_Developer/books/LX_86_AppTune/sgi_html/ch05.html which mentions dplace utility in section "Using the dplace Command". It has option for text replication:

-r :指定应在运行应用程序的一个或多个节点上复制文本.在某些情况下,复制将通过减少对代码进行节点外内存引用的需求来提高性能.复制选项适用于dplace命令放置的所有程序.有关文本复制的其他信息,请参见dplace(5)手册页.复制选项是由以下一个或多个字符组成的字符串:

-r: Specifies that text should be replicated on the node or nodes where the application is running. In some cases, replication will improve performance by reducing the need to make offnode memory references for code. The replication option applies to all programs placed by the dplace command. See the dplace(5) man page for additional information on text replication. The replication options are a string of one or more of the following characters:

l 复制库文本

b 复制二进制(a.out)文本

b Replicate binary (a.out) text

t 线程循环选项

Man dplace(1):http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman /man1/dplace.1.html

Man dplace(1): http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man1/dplace.1.html

Man dplace(5):http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman /man5/dplace.5.html

Man dplace(5): http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man5/dplace.5.html

这篇关于NUMA计算机上的共享库瓶颈的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆