如果我运行的子作业多于分配的核心数,会发生什么情况 [英] What happens if I am running more subjobs than the number of core allocated

查看:115
本文介绍了如果我运行的子作业多于分配的核心数,会发生什么情况的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有一个sbatch(slurm作业调度程序)脚本,其中我通过3个脚本处理大量数据:foo1.sh,foo2.sh和foo3.sh.

So I have a sbatch (slurm job scheduler) script in which I am processing a lot of data through 3 scripts: foo1.sh, foo2.sh and foo3.sh.

foo1.sh和foo2.sh是独立的,我想同时运行它们. foo3.sh需要foo1.sh和foo2.sh的输出,因此我正在构建依赖项. 然后我必须重复30次.

foo1.sh and foo2.sh are independent and I want to run them simultaneously. foo3.sh needs the outputs of foo1.sh and foo2.sh so I am building a dependency. And then I have to repeat it 30 times.

说:

## Resources config
#SBATCH --ntasks=30
#SBATCH --task-per-core=1

for i in {1..30};
do
    srun -n 1 --jobid=foo1_$i ./foo1.sh &
    srun -n 1 --jobid=foo2_$i ./foo2.sh &
    srun -n 1 --jobid=foo3_$i --dependency=afterok:foo1_$1:foo2_$i ./foo3.sh &
done;
wait

想法是您启动foo1_1和foo2_1,但是由于foo3_1必须等待其他两个作业完成,因此我想进行下一次迭代.下一次迭代将启动foo1_2 foo2_2和foo3_2等.

The idea being that you launch foo1_1 and foo2_1 but since foo3_1 have to wait for the two other jobs to finish, I want to go to the next iteration. The next iteration is going to launch foo1_2 foo2_2 and foo3_2 will wait etc.

那么,在某个时候,使用srun启动的子作业的数量将大于--ntasks = 30.会发生什么?它会等待上一份工作完成(我正在寻找的行为)吗?

At some point, then, the number of subjobs launched with srun will be higher than --ntasks=30. What is going to happen? Will it wait for a previous job to finish (behavior I am looking for)?

谢谢

推荐答案

Slurm将运行30个srun,但31日将等待30个内核分配中的一个内核被释放. 请注意,正确的参数是--ntasks-per-core=1,而不是--tasks-per-core=1

Slurm will run 30 srun's but the 31st will wait that a core get freed within your 30-cores allocation. note that the proper argument is --ntasks-per-core=1, and not --tasks-per-core=1

您可以使用salloc而不是sbatch进行交互工作以自己进行测试:

You can test it by yourself using salloc rather than sbatch to work interactively:

$ salloc --ntasks=2 --ntasks-per-core=1
$ srun -n 1 sleep 10 & srun -n 1 sleep 10 & time srun -n 1 echo ok
[1] 2734
[2] 2735
ok
[1]-  Done                    srun -n 1 sleep 10
[2]+  Done                    srun -n 1 sleep 10

real    0m10.201s
user    0m0.072s
sys 0m0.028s

您看到简单的echo花了10秒钟,因为第三个srun必须等待直到前两个完成,因为分配仅是两个内核.

You see that the simple echo took 10 seconds because the third srun had to wait until the first two have finished as the allocation is two cores only.

这篇关于如果我运行的子作业多于分配的核心数,会发生什么情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆