运行具有多个节点的Slurm脚本,使用1个任务启动作业步骤 [英] Running slurm script with multiple nodes, launch job steps with 1 task

查看:653
本文介绍了运行具有多个节点的Slurm脚本,使用1个任务启动作业步骤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用批处理脚本启动大量作业步骤。不同的步骤可能是完全不同的程序,并且每个步骤确实需要一个CPU。首先,我尝试使用对 srun -multi-prog 参数执行此操作。不幸的是,以这种方式使用分配给我的工作的所有CPU时,性能会大大降低。运行时间几乎增加到其序列化值。通过订阅不足,我可以对此有所改善。我在网上找不到有关此问题的任何信息,因此我认为这是我正在使用的群集的配置问题。

I am trying to launch a large number of job steps using a batch script. The different steps can be completely different programs and do need exactly one CPU each. First I tried doing this using the --multi-prog argument to srun. Unfortunately, when using all CPUs assigned to my job in this manner, performance degrades massively. The run time increases to almost its serialized value. By undersubscribing I could ameliorate this a little. I couldn't find anything online regarding this problem, so I assumed it to be a configuration problem of the cluster I am using.

所以我尝试了另一种方法。我实现了以下脚本(通过 sbatch my_script.slurm 启动):

So I tried going a different route. I implemented the following script (launched via sbatch my_script.slurm):

#!/bin/bash
#SBATCH -o $HOME/slurm/slurm_out/%j.%N.out
#SBATCH --error=$HOME/slurm/slurm_out/%j.%N.err_out
#SBATCH --get-user-env
#SBATCH -J test
#SBATCH -D $HOME/slurm
#SBATCH --export=NONE
#SBATCH --ntasks=48

NR_PROCS=$(($SLURM_NTASKS))
for PROC in $(seq 0 $(($NR_PROCS-1)));
do
    #My call looks like this:
    #srun --exclusive -n1 bash $PROJECT/call_shells/call_"$PROC".sh &
    srun --exclusive -n1 hostname &
    pids[${PROC}]=$!    #Save PID of this background process
done
for pid in ${pids[*]};
do
    wait ${pid} #Wait on all PIDs, this returns 0 if ANY process fails
done

我知道,在我的案例中,确实不需要专有的参数。调用的shell脚本包含不同的二进制文件及其参数。我脚本的其余部分依赖于以下事实:所有进程已完成,因此 wait 。我更改了呼叫线路,使其成为一个最小的工作示例。

I am aware, that the --exclusive argument is not really needed in my case. The shell scripts called contain the different binaries and their arguments. The remaining part of my script relies on the fact that all processes have finished hence the wait. I changed the calling line to make it a minimal working example.

起初,这似乎是解决方案。不幸的是,当增加我的工作分配中使用的节点数时(例如,通过将-ntasks 增加到大于集群中每个节点的CPU数),脚本不再按预期运行,返回

At first this seemed to be the solution. Unfortunately when increasing the number of nodes used in my job allocation (for example by increasing --ntasks to a number larger than the number of CPUs per node in my cluster), the script does not work as expected anymore, returning

srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1

并继续仅使用一个节点(在我的情况下为48个CPU,这需要完成工作步骤)

and continuing using only one node (i.e. 48 CPUs in my case, which go through the job steps as fast as before, all processes on the other node(s) are subsequently killed).

这似乎是预期的行为,但我无法真正理解它。 。为什么给定分配中的每个作业步骤都需要包含等于分配中所包含的节点数的最小任务数。我通常真的根本不在乎分配中使用的节点数。

This seems to be the expected behaviour, but I can't really understand it. Why is it that every job step in a given allocation needs to include a minimum number of tasks equal to the number of nodes included in the allocation. I ordinarily really do not care at all about the number of nodes used in my allocation.

我如何实现我的批处理脚本,因此可以可靠地在多个节点上使用它?

How can I implement my batch script, so it can be used on multiple nodes reliably?

推荐答案

找到了!含糊的术语和许多命令行选项使我感到困惑。

Found it! The nomenclature and the many command line options to slurm confused me. The solution is given by

#!/bin/bash
#SBATCH -o $HOME/slurm/slurm_out/%j.%N.out
#SBATCH --error=$HOME/slurm/slurm_out/%j.%N.err_out
#SBATCH --get-user-env
#SBATCH -J test
#SBATCH -D $HOME/slurm
#SBATCH --export=NONE
#SBATCH --ntasks=48

NR_PROCS=$(($SLURM_NTASKS))
for PROC in $(seq 0 $(($NR_PROCS-1)));
do
    #My call looks like this:
    #srun --exclusive -N1 -n1 bash $PROJECT/call_shells/call_"$PROC".sh &
    srun --exclusive -N1 -n1 hostname &
    pids[${PROC}]=$!    #Save PID of this background process
done
for pid in ${pids[*]};
do
    wait ${pid} #Wait on all PIDs, this returns 0 if ANY process fails
done

这指定仅在包含单个任务的一个节点上运行作业。

This specifies to run the job on exactly one node incorporating a single task only.

这篇关于运行具有多个节点的Slurm脚本,使用1个任务启动作业步骤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆