OpenMP和NUMA关系? [英] OpenMP and NUMA relation?

查看:642
本文介绍了OpenMP和NUMA关系?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个双插槽Xeon E5522 2.26GHZ机器(禁用超线程)运行ubuntu服务器在linux内核3.0支持NUMA。架构布局为每个插槽4个物理核心。
OpenMP应用程序在此机器中运行,我有以下问题:

I have a dual socket Xeon E5522 2.26GHZ machine (with hyperthreading disabled) running ubuntu server on linux kernel 3.0 supporting NUMA. The architecture layout is 4 physical cores per socket. An OpenMP application runs in this machine and i have the following questions:


  1. 即线程及其私有数据在NUMA机器+感知内核上运行时自动保存在沿着执行的numa节点上。

  1. Does an OpenMP program take advantage (i.e a thread and its private data are kept on a numa node along the execution) automatically when running on a NUMA machine + aware kernel?. If not, what can be done?

NUMA和每个线程的私有C ++ STL数据结构如何?

what about NUMA and per thread private C++ STL data structures ?


推荐答案

当前的OpenMP标准定义了一个布尔环境变量 OMP_PROC_BIND 绑定OpenMP线程。如果设置为 true ,例如

The current OpenMP standard defines a boolean environment variable OMP_PROC_BIND that controlls binding of OpenMP threads. If set to true, e.g.

shell$ OMP_PROC_BIND=true OMP_NUM_THREADS=12 ./app.x

那么OpenMP执行环境不应该在处理器之间移动线程。不幸的是,没有什么更多的关于这些线程应该如何绑定,这是一个特殊的工作组在OpenMP语言委员会正在解决的问题。 OpenMP 4.0将带有新的环境变量和子句,允许指定如何分配线程。当然,许多OpenMP实现提供了自己的非标准方法来控制绑定

then the OpenMP execution environment should not move threads between processors. Unfortunately nothing more is said about how those threads should be bound and that's what a special working group in the OpenMP language comittee is addressing right now. OpenMP 4.0 will come with new envrionment variables and clauses that will allow one to specify how to distribute the threads. Of course, many OpenMP implementations offer their own non-standard methods to control binding.

仍然大多数OpenMP运行时不支持NUMA。他们会乐意派遣线程到任何可用的CPU,你必须确保每个线程只访问属于它的数据。这方面有一些一般的暗示:

Still most OpenMP runtimes are not NUMA aware. They will happily dispatch threads to any available CPU and you would have to make sure that each thread only access data that belongs to it. There are some general hints in this direction:


  • 不要使用动态并行 for (C / C ++)/ DO (Fortran)循环。

  • 尝试在以后使用它的同一个线程中初始化数据。如果对于具有相同的团队大小和相同数量的迭代块的循环,使用 static 调度运行两个并行

  • 如果使用OpenMP任务,请尝试在任务主体中初始化数据,因为它会在线程0中执行。大多数OpenMP运行时实现任务窃取 - 空闲线程可以从其他线程的任务队列窃取任务。

  • 使用NUMA感知的内存分配器,如 tcmalloc

  • Do not use dynamic scheduling for parallel for (C/C++) / DO (Fortran) loops.
  • Try to initialise the data in the same thread that will later use it. If you run two separete parallel for loops with the same team size and the same number of iteration chunks, with static scheduling chunk 0 of both loops will be executed by thread 0, chunk 1 - by thread 1, and so on.
  • If using OpenMP tasks, try to initialise the data in the task body, because most OpenMP runtimes implement task stealing - idle threads can steal tasks from other threads' task queues.
  • Use a NUMA-aware memory allocator like tcmalloc.

我的一些同事已经彻底评估了NUMA在不同OpenMP运行时的行为,并特别研究了NUMA对英特尔的实施,但文章尚未发布,所以我不能为您提供链接。

Some colleagues of mine have thoroughly evaluated the NUMA behavious of different OpenMP runtimes and have specifically looked into the NUMA awareness of the Intel's implementation, but the articles are not published yet so I cannot provide you with a link.

有一个研究项目,称为 ForestGOMP ,目的是为 libgomp 提供NUMA感知的插入替换。可能你应该给看一看。

There is one research project, called ForestGOMP, which aims at providing a NUMA-aware drop-in replacement for libgomp. May be you should give it a look.

这篇关于OpenMP和NUMA关系?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆