如何将线程固定到具有预定内存池对象的内核? (80核心Nehalem架构2Tb RAM) [英] How to pin threads to cores with predetermined memory pool objects? (80 core Nehalem architecture 2Tb RAM)

查看:109
本文介绍了如何将线程固定到具有预定内存池对象的内核? (80核心Nehalem架构2Tb RAM)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在具有2Tb DRAM的80核(160HT)Nehalem架构上运行一些测试后,我遇到了一个次要的HPC问题:

I've run into a minor HPC problem after running some tests on a 80core (160HT) nehalem architecture with 2Tb DRAM:

具有两个以上套接字的服务器开始停滞很多(延迟),因为每个线程开始请求有关错误"套接字上对象的信息,即请求来自正在处理一个套接字上某些对象的线程提取另一个插槽中DRAM中实际存在的信息.

A server with more than 2 sockets starts to stall a lot (delay) as each thread starts to request information about objects on the "wrong" socket, i.e. requests goes from a thread that is working on some objects on the one socket to pull information that is actually in the DRAM on the other socket.

即使我知道内核正在等待远程套接字返回请求,这些内核看起来还是100%被利用.

The cores appear 100% utilized, even though I know that they are waiting for the remote socket to return the request.

由于大多数代码都是异步运行的,因此重写代码要容易得多,因此我可以将消息从一个套接字上的线程解析为另一个套接字上的线程(无需锁定等待). 另外,我想将每个线程锁定到内存池,这样我就可以更新对象,而不是浪费垃圾回收器上的时间(约30%).

As most of the code runs asynchronously it is a lot easier to rewrite the code so I can just parse messages from the threads on the one socket to threads the other (no locked waiting). In addition I want to lock each threads to memory pools, so I can update objects instead of wasting time (~30%) on the garbage collector.

提出问题:

如何在Python中将线程固定到具有预定内存池对象的内核上?

更多背景信息:

当您将ZeroMQ放在中间并在每个ZMQworker管理的内存池之间传递消息时,Python可以毫无问题地运行多核.以ZMQ的8M msg/秒的速度,对象的内部更新所花费的时间长于管道可以填充的时间.此处描述了所有内容: http://zguide.zeromq.org/page:all#章节套接字和样式

Python has no problem running multicore when you put ZeroMQ in the middle and make an art out of passing messages between the memory pool managed by each ZMQworker. At ZMQ's 8M msg/second it the internal update of the objects take longer than the pipeline can be filled. This is all described here: http://zguide.zeromq.org/page:all#Chapter-Sockets-and-Patterns

因此,通过稍微简化,我产生了80个ZMQworker进程和1个ZMQrouter,并使用大量对象(实际为5.84亿个对象)加载了上下文. 从这个起点"开始,对象需要进行交互以完成计算.

So, with a little over-simplification, I spawn 80 ZMQworkerprocesses and 1 ZMQrouter and load the context with a large swarm of objects (584 million objects actually). From this "start-point" the objects need to interact to complete the computation.

这是想法:

  • 如果对象X"需要与对象Y"进行交互,并且在 python线程的本地内存池,然后进行交互 应该直接完成.
  • 如果"Object Y"在同一池中不可用,那么我希望它 通过ZMQrouter发送消息,并让路由器返回一个 在稍后的某个时间做出回应.我的体系结构是非阻塞的,因此在特定的python线程中进行的操作仅在继续运行而无需等待zmqRouters响应.即使对于在同一套接字上但在不同内核上的对象,我也不想进行交互,因为我更喜欢进行干净的消息交换,而不是让两个线程来操作同一内存对象.
  • If "object X" needs to interact with "Object Y" and is available in the local memory pool of the python-thread, then the interaction should be done directly.
  • If "Object Y" is NOT available in the same pool, then I want it to send a message through the ZMQrouter and let the router return a response at some later point in time. My architecture is non-blocking so what goes on in the particular python thread just continues without waiting for the zmqRouters response. Even for objects on the same socket but on a different core, I would prefer NOT to interact, as I prefer having clean message exchanges instead of having 2 threads manipulating the same memory object.

为此,我需要知道:

  1. 如何找出给定的python进程(线程)在哪个套接字上 继续.
  2. 如何将特定套接字上的内存池分配给python进程(一些malloc限制或类似限制,以便内存池的总和不会将内存池从一个套接字推到另一个套接字)
  3. 我没想到的事情.
  1. how to figure out which socket a given python process (thread) runs on.
  2. how assign a memory pool on that particular socket to the python process (some malloc limit or similar so that the sum of memory pools do not push the memory pool from one socket to another)
  3. Things I haven't thought of.

但是我无法在python文档中找到有关如何执行此操作的引用,而在Google上,我必须搜索错误的内容.

But I cannot find references in the python docs on how to do this and on google I must be searching for the wrong thing.

更新:

关于为什么要在MPI体系结构上使用ZeroMQ?"这一问题,请阅读以下主题:我正在处理的应用程序是Spread vs MPI vs zeromq?,尽管它在MPI 更合适的体系结构上进行了测试,但该应用程序是为分布式部署而设计的.

Regarding the question "why use ZeroMQ on a MPI architecture?", please read the thread: Spread vs MPI vs zeromq? as the application I am working on is being designed for a distributed deployment even though it is tested on a an architecture where MPI is more suitable.

更新2:

关于这个问题:

如何在Python(3)中将线程固定到具有预定内存池的内核",答案在

"How to pin threads to cores with predetermined memory pools in Python(3)" the answer is in psutils:

>>> import psutil
>>> psutil.cpu_count()
4
>>> p = psutil.Process()
>>> p.cpu_affinity()  # get
[0, 1, 2, 3]
>>> p.cpu_affinity([0])  # set; from now on, this process will run on CPU #0 only
>>> p.cpu_affinity()
[0]
>>>
>>> # reset affinity against all CPUs
>>> all_cpus = list(range(psutil.cpu_count()))
>>> p.cpu_affinity(all_cpus)
>>>

可以将工作线程固定到一个内核上,从而可以有效地利用NUMA(查找您的CPU类型以确认它是NUMA架构!)

The worker can be pegged to a core whereby the NUMA may be exploited effectively (lookup your CPU type to verify that it is a NUMA architecture!)

第二个元素是确定内存池.也可以使用 psutils

The second element is to determine the memory-pool. That can be done with psutils as well or the resource library:

推荐答案

您可能低估了这个问题,没有超级简单的方法可以完成所需的操作.作为一般准则,您需要在操作系统级别上进行操作,以便按自己的方式进行设置.您要使用所谓的"CPU关联性"和内存关联性",并且需要认真考虑系统架构以及软件架构,以使事情变得正确.在实际的HPC中,命名的关联性"通常由MPI库(例如Open MPI)处理.您可能要考虑使用一个,然后让该MPI库处理您的不同进程.操作系统,MPI库和Python之间的接口可以通过mpi4py软件包提供.

You might underestimate the issue, there is no super-easy way to accomplish what you want. As a general guideline, you need to work at the operating system level to get things set up the way you want. You want to work with so-called "CPU affinity" and "memory affinity" and you need to think hard about your system architecture as well as your software architecture to get things right. In real HPC, the named "affinities" are normally handled by an MPI library, such as Open MPI. You might want to consider using one and let your different processes be handled by that MPI library. The interface between operating system, MPI library and Python can be provided by the mpi4py package.

您还需要弄清线程和进程的概念以及操作系统的设置.虽然对于CPU时间调度程序来说,线程是要调度的任务,因此理论上可以具有单独的亲和力,但我只知道整个进程(即一个进程中的所有线程)的亲和力掩码.对于控制内存访问,NUMA(非统一内存访问)是正确的关键字,您可能需要研究

You also need to get your concept of threads and processes and the OS setting straight. While for the CPU time scheduler, a thread is a task to be scheduled and therefore theoretically could have an individual affinity, I am only aware of affinity masks for entire processes, i.e. for all threads within one process. For controlling memory access, NUMA (non-uniform memory access) is the right keyword and you might want to look into http://linuxmanpages.com/man8/numactl.8.php

无论如何,您都需要阅读有关亲和性主题的文章,并且可能想开始阅读有关CPU/内存亲和性的Open MPI常见问题解答: http://www.open-mpi.de/faq/? category = tuning#paffinity-defs

In any case, you need to read articles about the affinity topic and might want to start reading in the Open MPI FAQs about CPU/memory affinity: http://www.open-mpi.de/faq/?category=tuning#paffinity-defs

如果不使用MPI库就可以实现目标,请查看Linux发行版的软件包util-linuxschedutilsnumactl,以获得有用的命令行工具,例如taskset.你可以例如从Python中调用以设置某些进程ID的相似性掩码.

In case you want to achieve your goal without using an MPI library, look into the packages util-linux or schedutils and numactl of your Linux distribution in order to get useful commandline tools such as taskset, which you could e.g. call from within Python in order to set affinity masks for certain process IDs.

本文似乎生动地描述了MPI库如何帮助解决您的问题:

This article seems to vividly describe how an MPI library can be helpful with your issue:

http://blogs.cisco. com/performance/open-mpi-v1-5-processor-affinity-options/

这个SO答案描述了如何将硬件架构一分为二: https://stackoverflow.com/a/11761943/145400

This SO answer describes how you bisect your hardware architecture: https://stackoverflow.com/a/11761943/145400

通常,我想知道您所使用的机器是否适合该任务,或者您是否在错误的一端进行了优化.如果您要在一台计算机内进行消息传递,并且达到内存带宽限制,那么我不确定ZMQ(通过TCP/IP,对吗?)是否完全适合执行消息传递.回到MPI,是HPC应用程序的 消息传递接口...

Generally, I am wondering if the machine you are applying is the right one for the task or if you maybe are optimizing at the wrong end. If you are messaging within one machine and hitting memory bandwidth limits, I am not sure if ZMQ (through TCP/IP, right?) is the right tool at all to perform the messaging. Coming back to MPI, the message passing interface for HPC applications...

这篇关于如何将线程固定到具有预定内存池对象的内核? (80核心Nehalem架构2Tb RAM)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆