如何使用一个 slurm 批处理脚本并行运行作业? [英] How to run jobs in paralell using one slurm batch script?

查看:718
本文介绍了如何使用一个 slurm 批处理脚本并行运行作业?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试与一个 Slurm 批处理脚本并行运行多个 python 脚本.看看下面的例子:

I am trying to run multiple python scripts in parallel with one Slurm batch script. Take a look at the example below:

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=/dev/null
#SBATCH --error=/dev/null
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --partition=All
#SBATCH --time=5:00

srun sleep 60
srun sleep 60
wait

我如何调整脚本以使执行只需要 60 秒(而不是 120 秒)?不能将脚本拆分为两个脚本.

How do I tweak the script such that the execution will take only 60 sec (instead of 120) ? Splitting the script into two scripts is not an option.

推荐答案

正如所写,该脚本正在运行两个 sleep 命令并行两次一行.

As written, that script is running two sleep commands in parallel, two times in a row.

每个 srun 命令都会启动一个 step,并且由于您设置了 --ntasks=2,所以每个 step 都会实例化两个 tasksem>(这里是 sleep 命令).

Each srun command initiates a step, and since you set --ntasks=2 each step instantiates two tasks (here the sleep command).

如果你想并行运行两个 1-task 步骤,你应该这样写:

If you want to run two 1-task steps in parallel, you should write it this way:

srun --exclusive -n 1 -c 1 sleep 60 &
srun --exclusive -n 1 -c 1 sleep 60 &
wait

然后每一步只实例化一个任务,并以&分隔符为背景,意味着下一个srun可以立即开始.wait 命令确保脚本仅在两个步骤都完成时终止.

Then each step only instantiates one task, and is backgrounded by the & delimiter, meaning the next srun can start immediately. The wait command makes sure the script terminates only when both steps are finished.

在这种情况下,xargs 命令和 GNU parallel 命令可用于避免编写多个相同的 srun 行或避免 for- 循环.

In that context, the xargs command and the GNU parallel commands can be useful to avoid writing multiple identical srun lines or avoiding a for-loop.

例如,如果您有多个文件,则需要运行脚本:

For instance, if you have multiple files you need to run your script over:

find /path/to/data/*.csv -print0 | xargs -0 -n1 -P $SLURM_NTASKS srun -n1 --exclusive python my_python_script.py

这相当于写了很多

srun -n 1 -c 1 --exclusive python my_python_script.py /path/to/data/file1.csv &
srun -n 1 -c 1 --exclusive python my_python_script.py /path/to/data/file1.csv &
srun -n 1 -c 1 --exclusive python my_python_script.py /path/to/data/file1.csv &
[...]

GNU parallel 可用于迭代参数值:

GNU parallel is useful to iterate over parameter values:

parallel -P $SLURM_NTASKS srun  -n1 --exclusive python my_python_script.py ::: {1..1000}

将运行

python my_python_script.py 1
python my_python_script.py 2
python my_python_script.py 3
...
python my_python_script.py 1000

另一种方法就是直接运行

srun python my_python_script.py

并在 Python 脚本中查找 SLURM_PROCID 环境变量并根据其值拆分工作.srun 命令将启动脚本的多个实例,每个实例将看到"SLURM_PROCID 的不同值.

and, inside the Python script, to look for the SLURM_PROCID environment variable and split the work according to its value. The srun command will start multiple instances of the script and each will 'see' a different value for SLURM_PROCID.

import os
print(os.environ['SLURM_PROCID'])

这篇关于如何使用一个 slurm 批处理脚本并行运行作业?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆