SLURM sbatch作业数组用于相同的脚本,但具有不同的输入参数并行运行 [英] SLURM sbatch job array for the same script but with different input arguments run in parallel

查看:888
本文介绍了SLURM sbatch作业数组用于相同的脚本,但具有不同的输入参数并行运行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个问题,我需要启动相同的脚本,但输入参数不同.

I have a problem where I need to launch the same script but with different input arguments.

说我有一个脚本myscript.py -p <par_Val> -i <num_trial>,在这里我需要考虑N不同的par_values(在x0x1之间)和M个试验,分别用于par_values的每个值.

Say I have a script myscript.py -p <par_Val> -i <num_trial>, where I need to consider N different par_values (between x0 and x1) and M trials for each value of par_values.

对M的每次试用都几乎达到了我正在研究的集群的时间限制(并且我没有更改它的权限).因此,在实践中,我需要运行NxM个独立的作业.

Each trial of M is such that almost reaches the time limits of the cluster where I am working on (and I don't have priviledges to change this). So in practice I need to run NxM independent jobs.

因为每个批处理作业具有相同的节点/cpu配置,并调用相同的python脚本(除了更改输入参数外),原则上,在伪语言中,我应该有一个sbatch脚本,该脚本应执行以下操作:

Because each batch jobs has the same node/cpu configuration, and invokes the same python script, except for changing the input parameters, in principle, in pseudo-language I should have a sbatch script that should do something like:

#!/bin/bash
#SBATCH --job-name=cv_01
#SBATCH --output=cv_analysis_eis-%j.out
#SBATCH --error=cv_analysis_eis-%j.err
#SBATCH --partition=gpu2
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4

for p1 in 0.05 0.075 0.1 0.25 0.5
do
    for i in {0..150..5}
    do
        python myscript.py -p p1 -v i
    done
done

其中,脚本的每次调用本身都是批处理作业. 查看 sbatch文档-a --array选项似乎很有希望.但就我而言,我需要更改我拥有的每个NxM脚本的输入参数.我怎样才能做到这一点?我不想编写NxM批处理脚本,然后按照此处提出的解决方案很理想,因为这作业数组的情况不正确.此外,我想确保所有NxM脚本都同时启动,并且上述脚本的调用会在之后立即终止,以免与时间限制冲突,并且我的整个工作将被终止系统并保持不完整(但是,由于每个NxM作业都在此限制之内,因此,如果它们并行但独立地运行,则不会发生这种情况).

where every call of the script is itself a batch job. Looking at the sbatch doc, the -a --array option seems promising. But in my case I need to change the input parameters for every script of the NxM that I have. How can I do this? I would like not to write NxM batch scripts and then list them in a txt file as suggested by this post. Nor the solution proposed here seems ideal, as this is the case imho of a job array. Moreover I would like to make sure that all the NxM scripts are launched at the same time, and the invoking above script is terminated right after, so that it won't clash with the time limit and my whole job will be terminated by the system and remain incomplete (whereas, since each of the NxM jobs is within such limit, if they are run together in parallel but independent, this won't happen).

推荐答案

最好的方法是使用作业数组.

The best approach is to use job arrays.

一种选择是在提交作业脚本时传递参数p1,因此您只有一个脚本,但是必须多次提交,每个p1值一次.

One option is to pass the parameter p1 when submitting the job script, so you will only have one script, but will have to submit it multiple times, once for each p1 value.

代码将如下所示(未经测试):

The code will be like this (untested):

#!/bin/bash
#SBATCH --job-name=cv_01
#SBATCH --output=cv_analysis_eis-%j-%a.out
#SBATCH --error=cv_analysis_eis-%j-%a.err
#SBATCH --partition=gpu2
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH -a 0-150:5

python myscript.py -p $1 -v $SLURM_ARRAY_TASK_ID

,您将提交:

sbatch my_jobscript.sh 0.05
sbatch my_jobscript.sh 0.075
...

另一种方法是在bash数组中定义所有p1参数并提交NxM个作业(未经测试)

Another approach is to define all the p1 parameters in a bash array and submit NxM jobs (untested)

#!/bin/bash
#SBATCH --job-name=cv_01
#SBATCH --output=cv_analysis_eis-%j-%a.out
#SBATCH --error=cv_analysis_eis-%j-%a.err
#SBATCH --partition=gpu2
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#Make the array NxM
#SBATCH -a 0-150

PARRAY=(0.05 0.075 0.1 0.25 0.5)    

#p1 is the element of the array found with ARRAY_ID mod P_ARRAY_LENGTH
p1=${PARRAY[`expr $SLURM_ARRAY_TASK_ID % ${#PARRAY[@]}`]}
#v is the integer division of the ARRAY_ID by the lenght of 
v=`expr $SLURM_ARRAY_TASK_ID / ${#PARRAY[@]}`
python myscript.py -p $p1 -v $v

这篇关于SLURM sbatch作业数组用于相同的脚本,但具有不同的输入参数并行运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆