如何使用一个 slurm 批处理脚本并行运行作业? [英] How to run jobs in paralell using one slurm batch script?

查看：718 发布时间：2021/6/14 18:54:10 parallel-processing slurm

本文介绍了如何使用一个 slurm 批处理脚本并行运行作业?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试与一个 Slurm 批处理脚本并行运行多个 python 脚本.看看下面的例子:

I am trying to run multiple python scripts in parallel with one Slurm batch script. Take a look at the example below:

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=/dev/null
#SBATCH --error=/dev/null
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --partition=All
#SBATCH --time=5:00

srun sleep 60
srun sleep 60
wait

我如何调整脚本以使执行只需要 60 秒(而不是 120 秒)?不能将脚本拆分为两个脚本.

How do I tweak the script such that the execution will take only 60 sec (instead of 120) ? Splitting the script into two scripts is not an option.

推荐答案

正如所写，该脚本正在运行两个 sleep 命令并行，两次一行.

As written, that script is running two sleep commands in parallel, two times in a row.

每个 srun 命令都会启动一个 step，并且由于您设置了 --ntasks=2，所以每个 step 都会实例化两个 tasksem>(这里是 sleep 命令).

Each srun command initiates a step, and since you set --ntasks=2 each step instantiates two tasks (here the sleep command).

如果你想并行运行两个 1-task 步骤，你应该这样写:

If you want to run two 1-task steps in parallel, you should write it this way:

srun --exclusive -n 1 -c 1 sleep 60 &
srun --exclusive -n 1 -c 1 sleep 60 &
wait

然后每一步只实例化一个任务，并以&分隔符为背景，意味着下一个srun可以立即开始.wait 命令确保脚本仅在两个步骤都完成时终止.

Then each step only instantiates one task, and is backgrounded by the & delimiter, meaning the next srun can start immediately. The wait command makes sure the script terminates only when both steps are finished.

在这种情况下，xargs 命令和 GNU parallel 命令可用于避免编写多个相同的 srun 行或避免 for- 循环.

In that context, the xargs command and the GNU parallel commands can be useful to avoid writing multiple identical srun lines or avoiding a for-loop.

例如，如果您有多个文件，则需要运行脚本:

For instance, if you have multiple files you need to run your script over:

find /path/to/data/*.csv -print0 | xargs -0 -n1 -P $SLURM_NTASKS srun -n1 --exclusive python my_python_script.py

这相当于写了很多

srun -n 1 -c 1 --exclusive python my_python_script.py /path/to/data/file1.csv &
srun -n 1 -c 1 --exclusive python my_python_script.py /path/to/data/file1.csv &
srun -n 1 -c 1 --exclusive python my_python_script.py /path/to/data/file1.csv &
[...]

GNU parallel 可用于迭代参数值:

GNU parallel is useful to iterate over parameter values:

parallel -P $SLURM_NTASKS srun  -n1 --exclusive python my_python_script.py ::: {1..1000}

将运行

python my_python_script.py 1
python my_python_script.py 2
python my_python_script.py 3
...
python my_python_script.py 1000

另一种方法就是直接运行

srun python my_python_script.py

并在 Python 脚本中查找 SLURM_PROCID 环境变量并根据其值拆分工作.srun 命令将启动脚本的多个实例，每个实例将看到"SLURM_PROCID 的不同值.

and, inside the Python script, to look for the SLURM_PROCID environment variable and split the work according to its value. The srun command will start multiple instances of the script and each will 'see' a different value for SLURM_PROCID.

import os
print(os.environ['SLURM_PROCID'])

这篇关于如何使用一个 slurm 批处理脚本并行运行作业?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用一个 slurm 批处理脚本并行运行作业? [英] How to run jobs in paralell using one slurm batch script?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用一个 slurm 批处理脚本并行运行作业? [英] How to run jobs in paralell using one slurm batch script?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭