NUMA 机器上的共享库瓶颈 [英] Shared Library bottleneck on NUMA machine

查看:25
本文介绍了NUMA 机器上的共享库瓶颈的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 NUMA 机器(SGI UV 1000)同时运行大量数值模拟,每个模拟都是使用 4 个内核的 OpenMP 作业.但是,运行大约 100 个以上的这些作业会导致显着的性能下降.我们关于为什么会发生这种情况的理论是,软件所需的共享库只加载一次到机器的全局内存中,然后系统会遇到通信瓶颈,因为所有进程都在单个节点上访问内存.

I'm using a NUMA machine (an SGI UV 1000) to run a large number of numerical simulations at the same time, each of which is an OpenMP job using 4 cores. However, running more than around 100 of these jobs results in a significant performance hit. Our theory as to why this happens is that the shared libraries required by the software are loaded only once into the machine's global memory, and the system is then experiencing a communication bottleneck as all processes are accessing memory on a single node.

这是一个旧软件,没有任何修改范围,并且静态 make 选项不会静态链接它需要的所有库.据我所知,最方便的解决方案是以某种方式强制系统在每个进程或节点上加载所需共享库的新副本(我在每个进程或节点上运行 3 个进程),但我没有能够找出如何做到这一点.谁能告诉我如何做到这一点,或者对如何解决这个问题有任何其他建议?

It's an old software with limited to no scope for modification and the static make option does not statically link all the libraries it needs. The most convenient solution, from what I can see, would be to somehow force the system to load a new copy of the required shared libraries on each process or node (on each of which I am running 3 processes), but I haven't been able to find out how to do this. Can anyone tell me how to do this, or have any other suggestions about how to solve this problem?

推荐答案

软件所需的共享库只加载一次到机器的全局内存中,

the shared libraries required by the software are loaded only once into the machine's global memory,

据我所知,这是 Linux 当前的行为.共享库只加载到一组物理内存,并且只加载到单个节点上.

As I know, this is the current behavior of Linux. Shared library is loaded only to one set of physical memory, and only on single node.

然后系统会遇到通信瓶颈,因为所有进程都在单个节点上访问内存.

and the system is then experiencing a communication bottleneck as all processes are accessing memory on a single node.

正如评论中所说,库中的指令应该缓存在每个处理器中,因此只有当库中的活动代码从缓存中擦除时才会出现瓶颈(例如,有很多不同的代码在工作).

As said in comments, instructions from library should be cached in every processor, so there can be bottleneck only if active code from library is wiped from cache (e.g. there is a lot of different code working).

您应该使用硬件性能计数器(缓存未命中、节点间 NUMA 内存访问计数)来验证您的理论.

You should to verify your theory by using hardware performance counters (misses from caches, inter-node NUMA memory access count).

将一些数据以多个副本存储在 NUMA 上的机制称为复制"在 Linux 上.内核、可执行文件或其共享库的代码称为文本.所以,您想要的是共享库的文本复制".我认为内核代码的文本复制更容易.

The mechanism of storing some data with several copies on NUMA called "replication" on linux. And code of kernel, executable or of its shared libraries is called text. So, what you want is "text replication for shared libraries". I think that text replication is easier for kernel codes.

我能够找到一些 2003 年的实验性补丁来进行此类文本复制,例如http://lwn.net/Articles/63512/ ([RFC][PATCH]NUMA 用户页面复制),作者为 IBM 的 Dave Hansen.这个补丁好像被拒绝了.

I was able to find some experimental patches from 2003 for doing such text replication, e.g. http://lwn.net/Articles/63512/ ([RFC][PATCH] NUMA user page replication) by Dave Hansen, IBM. This patch seems to be refused.

这种技术的更现代(2007)变体是页面缓存的复制:http://lwn.net/Articles/223056/(mm:复制的页面缓存)作者:SUSE 的 Nick Piggin.还有关于他的方法的介绍:http://ondioline.org/~paul/pagecachereplication.pdf.这将起作用,因为所有文件都存储在页面缓存中,包括可执行文件和共享库.但即使是这个补丁,我也无法在当前内核中找到它.

More modern (2007) variant of this technique is replication of pagecache: http://lwn.net/Articles/223056/ (mm: replicated pagecache) by Nick Piggin, SUSE. There is also presentation about his method: http://ondioline.org/~paul/pagecachereplication.pdf. This will work because all files are stored in pagecache, both executables and shared libraries. But even for this patch I can't find it in the current kernel.

在 SGI 上有更多的复制需求(他们拥有比典型内核开发人员更多的 NUMA 机器),因此可以添加一些补丁.NUMA 有 SGI 的应用程序调优手册:http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/linux/bks/SGI_Developer/books/LX_86_AppTune/sgi_html/ch05.html其中在使用 dplace 命令"部分中提到了 dplace 实用程序.它具有文本复制选项:

On SGI there is more needs of replications (they have more NUMA machines that typical kernel developer), so there is can be some addition patches. There is an SGI's application tuning manual for NUMA: http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/linux/bks/SGI_Developer/books/LX_86_AppTune/sgi_html/ch05.html which mentions dplace utility in section "Using the dplace Command". It has option for text replication:

-r:指定应在应用程序运行的一个或多个节点上复制文本.在某些情况下,复制将通过减少对代码进行节点外内存引用的需要来提高性能.复制选项适用于 dplace 命令放置的所有程序.有关文本复制的更多信息,请参见 dplace(5) 手册页.复制选项是由以下一个或多个字符组成的字符串:

-r: Specifies that text should be replicated on the node or nodes where the application is running. In some cases, replication will improve performance by reducing the need to make offnode memory references for code. The replication option applies to all programs placed by the dplace command. See the dplace(5) man page for additional information on text replication. The replication options are a string of one or more of the following characters:

l 复制库文本

b 复制二进制 (a.out) 文本

b Replicate binary (a.out) text

t 线程循环选项

Man dplace(1): http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man1/dplace.1.html

Man dplace(1): http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man1/dplace.1.html

Man dplace(5): http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man5/dplace.5.html

Man dplace(5): http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man5/dplace.5.html

这篇关于NUMA 机器上的共享库瓶颈的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆