如果我运行的子作业多于分配的核心数,会发生什么情况 [英] What happens if I am running more subjobs than the number of core allocated
问题描述
所以我有一个sbatch(slurm作业调度程序)脚本,其中我通过3个脚本处理大量数据:foo1.sh,foo2.sh和foo3.sh.
So I have a sbatch (slurm job scheduler) script in which I am processing a lot of data through 3 scripts: foo1.sh, foo2.sh and foo3.sh.
foo1.sh和foo2.sh是独立的,我想同时运行它们. foo3.sh需要foo1.sh和foo2.sh的输出,因此我正在构建依赖项. 然后我必须重复30次.
foo1.sh and foo2.sh are independent and I want to run them simultaneously. foo3.sh needs the outputs of foo1.sh and foo2.sh so I am building a dependency. And then I have to repeat it 30 times.
说:
## Resources config
#SBATCH --ntasks=30
#SBATCH --task-per-core=1
for i in {1..30};
do
srun -n 1 --jobid=foo1_$i ./foo1.sh &
srun -n 1 --jobid=foo2_$i ./foo2.sh &
srun -n 1 --jobid=foo3_$i --dependency=afterok:foo1_$1:foo2_$i ./foo3.sh &
done;
wait
想法是您启动foo1_1和foo2_1,但是由于foo3_1必须等待其他两个作业完成,因此我想进行下一次迭代.下一次迭代将启动foo1_2 foo2_2和foo3_2等.
The idea being that you launch foo1_1 and foo2_1 but since foo3_1 have to wait for the two other jobs to finish, I want to go to the next iteration. The next iteration is going to launch foo1_2 foo2_2 and foo3_2 will wait etc.
那么,在某个时候,使用srun启动的子作业的数量将大于--ntasks = 30.会发生什么?它会等待上一份工作完成(我正在寻找的行为)吗?
At some point, then, the number of subjobs launched with srun will be higher than --ntasks=30. What is going to happen? Will it wait for a previous job to finish (behavior I am looking for)?
谢谢
推荐答案
Slurm将运行30个srun
,但31日将等待30个内核分配中的一个内核被释放.
请注意,正确的参数是--ntasks-per-core=1
,而不是--tasks-per-core=1
Slurm will run 30 srun
's but the 31st will wait that a core get freed within your 30-cores allocation.
note that the proper argument is --ntasks-per-core=1
, and not --tasks-per-core=1
您可以使用salloc而不是sbatch进行交互工作以自己进行测试:
You can test it by yourself using salloc rather than sbatch to work interactively:
$ salloc --ntasks=2 --ntasks-per-core=1
$ srun -n 1 sleep 10 & srun -n 1 sleep 10 & time srun -n 1 echo ok
[1] 2734
[2] 2735
ok
[1]- Done srun -n 1 sleep 10
[2]+ Done srun -n 1 sleep 10
real 0m10.201s
user 0m0.072s
sys 0m0.028s
您看到简单的echo
花了10秒钟,因为第三个srun
必须等待直到前两个完成,因为分配仅是两个内核.
You see that the simple echo
took 10 seconds because the third srun
had to wait until the first two have finished as the allocation is two cores only.
这篇关于如果我运行的子作业多于分配的核心数,会发生什么情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!