SLURM:令人尴尬的并行程序中的令人尴尬的并行程序 [英] SLURM: Embarrassingly parallel program inside an embarrassingly parallel program

查看:556
本文介绍了SLURM:令人尴尬的并行程序中的令人尴尬的并行程序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用Matlab编写的复杂模型.该模型不是我们编写的,最好将其视为黑匣子",即为了从内部修复相关问题,将需要重新编写整个模型,而这将需要数年时间.

如果遇到令人尴尬的并行"问题,我可以使用数组使用选项#SBATCH --array=1-X提交同一模拟的X个变体.但是,群集通常对最大阵列大小有一个(令人沮丧的小)限制.

在使用PBS/TORQUE群集时,我通过强制Matlab在单个线程上运行,请求多个CPU,然后在后台运行Matlab的多个实例来解决此问题.提交脚本示例为:

#!/bin/bash
<OTHER PBS COMMANDS>
#PBS -l nodes=1:ppn=5,walltime=30:00:00
#PBS -t 1-600

<GATHER DYNAMIC ARGUMENTS FOR MATLAB FUNCTION CALLS BASED ON ARRAY NUMBER>

# define Matlab options
options="-nodesktop -noFigureWindows -nosplash -singleCompThread"

for sub_job in {1..5}
do
    <GATHER DYNAMIC ARGUMENTS FOR MATLAB FUNCTION CALLS BASED ON LOOP NUMBER (i.e. sub_job)>
    matlab ${options} -r "run_model(${arg1}, ${arg2}, ..., ${argN}); exit" &
done
wait
<TIDY UP AND FINISH COMMANDS>

有人可以帮助我在SLURM集群上做同样的事情吗?

  • par函数不会在Matlab中的并行循环中运行我的模型.
  • PBS/TORQUE语言非常直观,但SLURM却使我感到困惑.假设结构类似的提交脚本与我的PBS示例类似,这就是我认为某些命令会导致的结果.
    • -ncpus-per-task = 5对我来说似乎是最明显的一个.我会在循环中将srun放在matlab命令的前面还是将其保留在PBS脚本循环中?
    • -ntasks = 5我可以想象会请求5个CPU,但是会串行运行,除非程序明确要求它们(例如MPI或Python-Multithreaded等).在这种情况下,是否需要将srun放在Matlab命令的前面?

解决方案

尽管汤姆(Tom)建议使用GNU Parallel是一个很好的建议,但我将尝试回答所提出的问题.

如果要使用相同的参数运行5个matlab命令实例(例如,如果它们通过MPI进行通信),则需要询问--ncpus-per-task=1--ntasks=5,并且应该在matlabsrun行并摆脱循环.

在您的情况下,由于您对matlab的5个调用中的每一个都是独立的,因此您要请求--ncpus-per-task=5--ntasks=1.这将确保您为每个作业分配5个CPU内核以完成所需的工作.您可以根据需要在matlab行的前面加上srun,但这对您仅运行一项任务的影响不大.

当然,这只有在您的5个matlab运行中的每个运行时间都相同的情况下才有效,因为如果一个运行时间更长,则其他4个CPU内核将处于空闲状态,等待第五个CPU内核完成. /p>

I have a complex model written in Matlab. The model was not written by us and is best thought of as a "black box" i.e. in order to fix the relevant problems from the inside would require rewritting the entire model which would take years.

If I have an "embarrassingly parallel" problem I can use an array to submit X variations of the same simulation with the option #SBATCH --array=1-X. However, clusters normally have a (frustratingly small) limit on the maximum array size.

Whilst using a PBS/TORQUE cluster I have got around this problem by forcing Matlab to run on a single thread, requesting multiple CPUs and then running multiple instances of Matlab in the background. An example submission script is:

#!/bin/bash
<OTHER PBS COMMANDS>
#PBS -l nodes=1:ppn=5,walltime=30:00:00
#PBS -t 1-600

<GATHER DYNAMIC ARGUMENTS FOR MATLAB FUNCTION CALLS BASED ON ARRAY NUMBER>

# define Matlab options
options="-nodesktop -noFigureWindows -nosplash -singleCompThread"

for sub_job in {1..5}
do
    <GATHER DYNAMIC ARGUMENTS FOR MATLAB FUNCTION CALLS BASED ON LOOP NUMBER (i.e. sub_job)>
    matlab ${options} -r "run_model(${arg1}, ${arg2}, ..., ${argN}); exit" &
done
wait
<TIDY UP AND FINISH COMMANDS>

Can anyone help me do the equivalent on a SLURM cluster?

  • The par function will not run my model in a parallel loop in Matlab.
  • The PBS/TORQUE language was very intuitive but SLURM's is confusing me. Assuming a similarly structured submission script as my PBS example, here is what I think certain commands will result in.
    • --ncpus-per-task=5 seems like the most obvious one to me. Would I put srun in front of the matlab command in the loop or leave it as it is in the PBS script loop?
    • --ntasks=5 I would imagine would request 5 CPUs but will run in serial unless a program specifically requests them (i.e. MPI or Python-Multithreaded etc). Would I need to put srun in front of the Matlab command in this case?

解决方案

While Tom's suggestion to use GNU Parallel is a good one, I will attempt to answer the question asked.

If you want to run 5 instances of the matlab command with the same arguments (for example if they were communicating via MPI) then you would want to ask for --ncpus-per-task=1, --ntasks=5 and you should preface your matlab line with srun and get rid of the loop.

In your case, as each of your 5 calls to matlab are independent, you want to ask for --ncpus-per-task=5, --ntasks=1. This will ensure that you allocate 5 CPU cores per job to do with as you wish. You can preface your matlab line with srun if you wish but it will make little difference you are only running one task.

Of course, this is only efficient if each of your 5 matlab runs take the same amount of time since if one takes much longer then the other 4 CPU cores will be sitting idle, waiting for the fifth to finish.

这篇关于SLURM:令人尴尬的并行程序中的令人尴尬的并行程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆