如何使用一个 slurm 批处理脚本并行运行作业? [英] How to run jobs in paralell using one slurm batch script?
问题描述
我正在尝试与一个 Slurm 批处理脚本并行运行多个 python 脚本.看看下面的例子:
I am trying to run multiple python scripts in parallel with one Slurm batch script. Take a look at the example below:
#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=/dev/null
#SBATCH --error=/dev/null
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --partition=All
#SBATCH --time=5:00
srun sleep 60
srun sleep 60
wait
我如何调整脚本以使执行只需要 60 秒(而不是 120 秒)?不能将脚本拆分为两个脚本.
How do I tweak the script such that the execution will take only 60 sec (instead of 120) ? Splitting the script into two scripts is not an option.
推荐答案
正如所写,该脚本正在运行两个 sleep
命令并行,两次一行.
As written, that script is running two sleep
commands in parallel, two times in a row.
每个 srun
命令都会启动一个 step,并且由于您设置了 --ntasks=2
,所以每个 step 都会实例化两个 tasksem>(这里是 sleep
命令).
Each srun
command initiates a step, and since you set --ntasks=2
each step instantiates two tasks (here the sleep
command).
如果你想并行运行两个 1-task 步骤,你应该这样写:
If you want to run two 1-task steps in parallel, you should write it this way:
srun --exclusive -n 1 -c 1 sleep 60 &
srun --exclusive -n 1 -c 1 sleep 60 &
wait
然后每一步只实例化一个任务,并以&
分隔符为背景,意味着下一个srun
可以立即开始.wait
命令确保脚本仅在两个步骤都完成时终止.
Then each step only instantiates one task, and is backgrounded by the &
delimiter, meaning the next srun
can start immediately. The wait
command makes sure the script terminates only when both steps are finished.
在这种情况下,xargs 命令和 GNU parallel 命令可用于避免编写多个相同的 srun
行或避免 for-
循环.
In that context, the xargs command and the GNU parallel commands can be useful to avoid writing multiple identical srun
lines or avoiding a for-
loop.
例如,如果您有多个文件,则需要运行脚本:
For instance, if you have multiple files you need to run your script over:
find /path/to/data/*.csv -print0 | xargs -0 -n1 -P $SLURM_NTASKS srun -n1 --exclusive python my_python_script.py
这相当于写了很多
srun -n 1 -c 1 --exclusive python my_python_script.py /path/to/data/file1.csv &
srun -n 1 -c 1 --exclusive python my_python_script.py /path/to/data/file1.csv &
srun -n 1 -c 1 --exclusive python my_python_script.py /path/to/data/file1.csv &
[...]
GNU parallel 可用于迭代参数值:
GNU parallel is useful to iterate over parameter values:
parallel -P $SLURM_NTASKS srun -n1 --exclusive python my_python_script.py ::: {1..1000}
将运行
python my_python_script.py 1
python my_python_script.py 2
python my_python_script.py 3
...
python my_python_script.py 1000
另一种方法就是直接运行
srun python my_python_script.py
并在 Python 脚本中查找 SLURM_PROCID
环境变量并根据其值拆分工作.srun
命令将启动脚本的多个实例,每个实例将看到"SLURM_PROCID
的不同值.
and, inside the Python script, to look for the SLURM_PROCID
environment variable and split the work according to its value. The srun
command will start multiple instances of the script and each will 'see' a different value for SLURM_PROCID
.
import os
print(os.environ['SLURM_PROCID'])
这篇关于如何使用一个 slurm 批处理脚本并行运行作业?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!