LSF中的混合MPI/OpenMP [英] Hybrid MPI/OpenMP in LSF

查看:137
本文介绍了LSF中的混合MPI/OpenMP的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将通过OpenMP并行化的程序移动到Cluster.该集群使用Lava 1.0作为调度程序,每个节点具有8个核心.我在作业脚本中使用了MPI包装器来进行多主机并行处理.

I am moving a program parallelized by OpenMP to Cluster. The cluster is using Lava 1.0 as scheduler and has 8 cores in each nodes. I used a MPI wrapper in the job script to do multi-host parallel.

这是工作脚本:

#BSUB -q queue_name
#BSUB -x

#BSUB -R "span[ptile=1]"
#BSUB -n 1

#BSUB -J n1p1o8
##BSUB -o outfile.email
#BSUB -e err

export OMP_NUM_THREADS=8

date
/home/apps/bin/lava.openmpi.wrapper -bynode -x OMP_NUM_THREADS \
    ~/my_program ~/input.dat ~/output.out 
date

我专门在一个主机上做了一些实验.但是,我不知道如何解释其中的一些结果.

I did some experiments on ONE host exclusively. However, I don't know how to explain some of the results.

1.
-n OMP_NUM_THREADS 时间
1     4       21:12  nbsp; n ;  
2     4       20:12   n ;  

1.
-nOMP_NUM_THREADStime
1      4      21:12      
2      4      20:12      

这是否意味着MPI在这里不做任何并行处理?我以为在第二种情况下,每个MPI进程都会有4个OMP线程,因此它应该使用800%的CPU使用率,这应该比第一个要快.

Does it mean MPI doesn't do any parallel here? I thought in second case every MPI process would have 4 OMP threads so it should use 800% CPU usage which should be faster than first one.

另一个证明它是
的结果 -n OMP_NUM_THREADS 时间
2     2       31:42  n ;  
4     2       30:47  n ;  

Another results to prove it is that
-nOMP_NUM_THREADStime
2      2      31:42      
4      2      30:47      

它们的运行时间也非常接近.

They also have pretty close run times.

2.
在这种情况下,如果我想通过简单的方法以合理的优化速度在此群集中并行执行此程序,是否在每个主机中放置1个MPI进程(告诉我使用一个内核的LFG),设置OMP_NUM_THREADS = 8,然后专门运行吗?因此,MPI仅适用于跨节点作业,而OpenMP适用于内部节点作业. (-n =主机数; ptile = 1; OMP_NUM_THREADS =每个主机中的最大内核数)

2.
In this case, if I want to parallel this program in this cluster with reasonable optimized speed by simple way, is it reasonable to put 1 MPI process (tell LFG that I use one core) in every host, set OMP_NUM_THREADS = 8, and then run it exclusively? Therefore MPI only works on cross-node jobs and OpenMP works on inner node jobs. (-n = # of host; ptile = 1; OMP_NUM_THREADS = Max cores in each host)

更新: 该程序由不带mpicc的gfortran -fopenmp编译. MPI仅用于分发可执行文件.

UPDATE: The program is compiled by gfortran -fopenmp without mpicc. MPI is only used to distribute the executable.

3月3日更新: 程序内存使用情况监控器

UPDATE Mar.3: Program memory usage monitor

本地环境:Mac 10.8/2.9 Ghz i7/8GB内存

没有OpenMP

  • 实际内存大小:8.4 MB
  • 虚拟内存大小:2.37 GB
  • 共享内存大小:212 KB
  • 私人内存大小:7.8 Mb
  • 虚拟专用内存:63.2 MB

使用OpenMP(4个线程)

With OpenMP (4 threads)

  • 实际内存大小:31.5 MB
  • 虚拟内存大小:2.52 GB
  • 共享内存大小:212 KB
  • 私人内存大小:27.1 Mb
  • 虚拟专用内存:210.2 MB

集群硬件简要信息

每个主机都包含双四核芯片,每个节点有8个内核和8GB内存.该群集中的主机通过infiniband连接.

Each host contains dual quad chips which is 8 cores per node and 8GB memory. The hosts in this cluster are connected by infiniband.

推荐答案

考虑到您在注释中指定的信息,最好的选择是:

Taking into account the information that you have specified in the comments, your best option is to:

  • 使用-x请求排他节点访问(您已经这样做);
  • 使用-n 1为每个节点请求一个插槽(您已经这样做了);
  • OMP_NUM_THREADS设置为每个节点的核心数(您已经这样做了);
  • 启用OpenMP线程的绑定;
  • 直接启动可执行文件.
  • request exclusive node access with -x (you already do that);
  • request one slot per node with -n 1 (you already do that);
  • set OMP_NUM_THREADS to the number of cores per node (you already do that);
  • enable binding of OpenMP threads;
  • launch the executable directly.

您的工作脚本应如下所示:

Your job script should look like this:

#BSUB -q queue_name
#BSUB -x
#BSUB -n 1

#BSUB -J n1p1o8
##BSUB -o outfile.email
#BSUB -e err

export OMP_NUM_THREADS=8
export OMP_PROC_BIND=true

date
~/my_program ~/input.dat ~/output.out
date

OMP_PROC_BIND是OpenMP 3.1规范的一部分.如果使用的是遵循该标准较早版本的编译器,则应使用特定于供应商的设置,例如对于GCC是GOMP_CPU_AFFINITY,对于英特尔编译器是KMP_AFFINITY.将线程绑定到内核可防止操作系统在不同处理器内核之间的线程间移动,从而加快执行速度,尤其是在数据局部性非常重要的NUMA系统(例如,具有多个CPU插槽且每个插槽中有单独的内存控制器的机器)上

OMP_PROC_BIND is part of OpenMP 3.1 specification. If using compiler which adheres to an older version of the standard, you should use the vendor-specific setting, e.g. GOMP_CPU_AFFINITY for GCC and KMP_AFFINITY for Intel compilers. Binding threads to cores prevents the operating system from moving around threads between different processor cores, which speeds up the executing, especially on NUMA systems (e.g. machines with more than one CPU sockets and separate memory controller in each socket) where data locality is very important.

如果要在不同的输入文件上运行程序的许多副本,请提交阵列作业.使用LSF(我想也可以使用Lava)是通过更改作业脚本来完成的:

If you'd like to run many copies of your program over different input files, then submit array jobs. With LSF (and I guess with Lava too) this is done by changing the job script:

#BSUB -q queue_name
#BSUB -x
#BSUB -n 1

#BSUB -J n1p1o8[1-20]
##BSUB -o outfile.email
#BSUB -e err_%I

export OMP_NUM_THREADS=8
export OMP_PROC_BIND=true

date
~/my_program ~/input_${LSF_JOBINDEX}.dat ~/output_${LSF_JOBINDEX}.out
date

这将提交一个包含20个子作业(-J n1p1o8[1-20])的阵列作业. -e中的%I被作业号替换,因此您将从每个作业中获得单独的err文件. LSF_JOBINDEX环境变量设置为当前作业索引,即第一个作业为1,第​​二个作业为2,依此类推.

This submits an array job of 20 subjobs (-J n1p1o8[1-20]). %I in -e is replaced by the job number so you'll get a separate err file from each job. The LSF_JOBINDEX environment variable is set to the current job index, i.e. it will be 1 in the first job, 2 in the second and so on.

我对程序的内存使用量的疑问不在于程序消耗多少内存.它涉及在单个OpenMP循环中处理的典型数据集的大小.如果数据集的大小不足以容纳到CPU的最后一级缓存中,则需要考虑内存带宽.如果您的代码对每个数据项进行繁重的本地处理,则它可能会随着线程数的增加而扩展.如果在另一方面进行简单轻便的处理,则内存总线可能会被单个线程饱和,尤其是在代码已正确向量化的情况下.通常,这是通过以FLOPS/字节为单位的所谓操作强度来衡量的.它给出了从内存中获取下一个数据元素之前发生的数据处理量.高操作强度意味着在CPU中会发生很多数字运算,并且很少很少将数据传输到内存或从内存传输数据.无论内存带宽如何,此类程序几乎都随线程数线性增长.另一方面,操作强度非常低的代码受内存限制,导致CPU使用不足.

My question about the memory usage of your program was not about how much memory does it consume. It was about how large is the typical dataset that is processed in a single OpenMP loop. If the dataset is not small enough to fit into the last-level cache of the CPU(s), then memory bandwidth comes into consideration. If your code does heavy local processing on each data item, then it might scale with the number of threads. If on the other side it does simple and light processing, then memory bus might get saturated even by a single thread, especially if the code is properly vectorised. Usually this is measured by the so-called operational intensity in FLOPS/byte. It gives the amount of data processing that happens before the next data element is fetched from memory. High operational intensity means that a lot of number crunching happens in the CPU and data is only seldom transferred to/from memory. Such programs scale almost linearly with the number of threads, no matter what the memory bandwidth is. On the other side, codes with very low operational intensity are memory-bound and they leave the CPU underutilised.

受内存限制很大的程序不会随线程数扩展,而是随可用内存带宽扩展.例如,在较新的Intel或AMD系统上,每个CPU插槽都有其自己的内存控制器和内存数据路径.在这样的系统上,存储器带宽是单个套接字的带宽的倍数,例如,1.具有两个插槽的系统提供的存储带宽是单插槽系统的两倍.在这种情况下,无论何时使用两个套接字,您都可能会看到代码运行时间的改善.如果将OMP_NUM_THREADS设置为等于内核总数,或者如果将OMP_NUM_THREADS设置为等于2并告诉运行时将两个线程放在不同的套接字上(这是线程执行时的合理情况)向量化的代码和一个线程就能使本地内存总线饱和.

A program that is heavily memory-bound doesn't scale with the number threads but with the available memory bandwidth. For example, on a newer Intel or AMD system, each CPU socket has its own memory controller and memory data path. On such systems the memory bandwidth is a multiple of the bandwidth of a single socket, e.g. a system with two sockets delivers twice the memory bandwidth of a single-socket system. In this case you might see improvement in the code run time whenever both sockets are used, e.g. if you set OMP_NUM_THREADS to be equal to the total number of cores or if you set OMP_NUM_THREADS to be equal to 2 and tell the runtime to put both threads on different sockets (this is a plausible scenario when threads are executing vectorised code and a single thread is able to saturate the local memory bus).

这篇关于LSF中的混合MPI/OpenMP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆