如何在脚本工作(以srun开头)完全完成之前保留脚本? [英] How to hold up a script until a slurm job (start with srun) is completely finished?
问题描述
我正在使用SLURM运行作业阵列,并使用以下作业阵列脚本(该作业以 sbatch job_array_script.sh [args]
:
I am running a job array with SLURM, with the following job array script (that I run with sbatch job_array_script.sh [args]
:
#!/bin/bash
#SBATCH ... other options ...
#SBATCH --array=0-1000%200
srun ./job_slurm_script.py $1 $2 $3 $4
echo 'open' > status_file.txt
为了解释,我希望将 job_slurm_script.py
运行为一次阵列作业1000次,最多并行200个任务。当所有所有完成后,我想在 status_file.txt
。这是因为实际上我有超过10,000个作业,并且它在群集的MaxSubmissionLimit之上,因此我需要将其拆分为较小的块(以1000个元素的作业数组),然后一个接一个地运行它们(仅当
To explain, I want job_slurm_script.py
to be run as an array job 1000 times with 200 tasks maximum in parallel. And when all of those are done, I want to write 'open' to status_file.txt
. This is because in reality I have more than 10,000 jobs, and this is above my cluster's MaxSubmissionLimit, so I need to split it into smaller chunks (at 1000-element job arrays) and run them one after the other (only when the previous one is finished).
但是,要使其正常工作,仅在整个作业数组完成后,echo语句才能触发(外部其中,我有一个循环检查 status_file.txt
,以便查看作业是否完成,即内容是否为字符串 open。
However, for this to work, the echo statement can only trigger once the entire job array is finished (outside of this, I have a loop which checks status_file.txt
so see if the job is finished, i.e when the contents are the string 'open').
到目前为止,我认为 srun
可以保留脚本,直到整个作业数组完成为止。但是,有时 srun
返回并且脚本在作业完成之前进入echo语句,因此所有后续作业都会从群集中弹起,因为它超出了提交限制。
Up to now I thought that srun
holds the script up until the whole job array is finished. However, sometimes srun
"returns" and the script goes to the echo statement before the jobs are finished, so all the subsequent jobs bounce off the cluster since it goes above the submission limit.
那么如何让运行
保持运行直到整个作业阵列完成?
So how do I make srun
"hold up" until the whole job array is finished?
推荐答案
您可以将标志-wait
添加到 sbatch 。
检查 sbatch 获取有关-等待
的信息。
这篇关于如何在脚本工作(以srun开头)完全完成之前保留脚本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!