SLURM `srun` 与 `sbatch` 及其参数 [英] SLURM `srun` vs `sbatch` and their parameters
问题描述
我试图了解 SLURM 的 srun
和 sbatch
命令.我会对一般性的解释感到满意,而不是对以下问题的具体答案,但这里有一些具体的混淆点,可以作为一个起点,让我了解我在寻找什么.
根据文档,srun
用于提交作业,和 sbatch
用于提交作业供以后执行,但我不清楚实际区别,它们的行为似乎相同.例如,我有一个有 2 个节点的集群,每个节点有 2 个 CPU.如果我连续执行 srun testjob.sh &
5 次,它将很好地将第五个作业排队,直到 CPU 可用,执行 sbatch testjob.sh
也是如此.
为了使问题更具体,我认为一个好的起点可能是:有些事情我可以用一个做而另一个我不能做,为什么?>
这两个命令的许多参数是相同的.看起来最相关的是--ntasks
、--nodes
、--cpus-per-task
、--ntasks-per-node
.它们之间有什么关系,srun
和 sbatch
有什么不同?
一个特别的区别是,如果 testjob.sh
没有可执行权限,即 chmod +x testjob.sh
,srun
将导致错误而 sbatch
会很高兴地运行它.是什么幕后"发生的事情导致了这种情况?
文档还提到 srun
通常用于 sbatch
脚本中.这就引出了一个问题:它们如何相互交互,它们各自的规范"用例是什么?具体来说,我会单独使用 srun
吗?
文档说
srun 用于实时提交作业执行
同时
sbatch 用于提交作业脚本供以后执行.
它们几乎都接受相同的参数集.主要区别在于 srun
是交互式和阻塞的(你在终端中得到结果,在它完成之前你不能写其他命令),而 sbatch
是批处理和非阻塞(结果写入文件,您可以立即提交其他命令).
如果您在后台使用 srun
并带有 &
符号,那么您删除了 srun
的阻塞"功能,它变成交互但非阻塞.尽管如此,它仍然是交互式的,这意味着输出会使您的终端混乱,并且 srun
进程链接到您的终端.如果断开连接,您将失去对它们的控制,或者它们可能会被杀死(取决于它们是否基本上使用 stdout
).如果您连接到提交作业的机器重新启动,它们将被杀死.
如果你使用 sbatch
,你提交你的工作,它由 Slurm 处理;您可以断开连接,杀死您的终端等,而不会产生任何后果.您的作业不再与正在运行的进程相关联.
有哪些事情我可以用一个做而另一个我不能做,为什么?
可用于 sbatch
而不是 srun
的功能是 作业数组.由于 srun
可以在 sbatch
脚本中使用,所以没有什么是你不能用 sbatch
做的.
它们之间有什么关系,srun 与 sbatch 之间有何不同?
所有参数--ntasks
、--nodes
、--cpus-per-task
、--ntasks-per-node
在两个命令中具有相同的含义.几乎所有参数都是如此,除了 --exclusive
值得注意的例外.
幕后"发生了什么?这会导致这种情况吗?
srun
立即在远程主机上执行脚本,而 sbatch
将脚本复制到内部存储中,然后在作业开始时将其上传到计算节点.您可以在提交后通过修改提交脚本来检查这一点;不会考虑更改(请参阅 this).
它们如何相互交互,什么是规范"?每个用例的用例?
您通常使用 sbatch
提交作业,并在提交脚本中使用 srun
来创建 Slurm 调用它们的作业步骤.srun
用于启动进程.如果您的程序是并行 MPI 程序,srun
负责创建所有 MPI 进程.如果没有,srun
将按照 --ntasks
选项指定的次数运行您的程序.有很多用例取决于你的程序是否并行,是否有长时间运行,是否由单个可执行文件组成等等. 除非另有说明,srun
默认继承它运行的 sbatch
或 salloc
的相关选项(来自 这里).
具体来说,我会单独使用 srun 吗?
除了小测试,没有.一个常见的用法是 srun --pty bash
在计算作业上获取 shell.
I am trying to understand what the difference is between SLURM's srun
and sbatch
commands. I will be happy with a general explanation, rather than specific answers to the following questions, but here are some specific points of confusion that can be a starting point and give an idea of what I'm looking for.
According to the documentation, srun
is for submitting jobs, and sbatch
is for submitting jobs for later execution, but the practical difference is unclear to me, and their behavior seems to be the same. For example, I have a cluster with 2 nodes, each with 2 CPUs. If I execute srun testjob.sh &
5x in a row, it will nicely queue up the fifth job until a CPU becomes available, as will executing sbatch testjob.sh
.
To make the question more concrete, I think a good place to start might be: What are some things that I can do with one that I cannot do with the other, and why?
Many of the arguments to both commands are the same. The ones that seem the most relevant are --ntasks
, --nodes
, --cpus-per-task
, --ntasks-per-node
. How are these related to each other, and how do they differ for srun
vs sbatch
?
One particular difference is that srun
will cause an error if testjob.sh
does not have executable permission i.e. chmod +x testjob.sh
whereas sbatch
will happily run it. What is happening "under the hood" that causes this to be the case?
The documentation also mentions that srun
is commonly used inside of sbatch
scripts. This leads to the question: How do they interact with each other, and what is the "canonical" usecase for each them? Specifically, would I ever use srun
by itself?
The documentation says
srun is used to submit a job for execution in real time
while
sbatch is used to submit a job script for later execution.
They both accept practically the same set of parameters. The main difference is that srun
is interactive and blocking (you get the result in your terminal and you cannot write other commands until it is finished), while sbatch
is batch processing and non-blocking (results are written to a file and you can submit other commands right away).
If you use srun
in the background with the &
sign, then you remove the 'blocking' feature of srun
, which becomes interactive but non-blocking. It is still interactive though, meaning that the output will clutter your terminal, and the srun
processes are linked to your terminal. If you disconnect, you will loose control over them, or they might be killed (depending on whether they use stdout
or not basically). And they will be killed if the machine to which you connect to submit jobs is rebooted.
If you use sbatch
, you submit your job and it is handled by Slurm ; you can disconnect, kill your terminal, etc. with no consequence. Your job is no longer linked to a running process.
What are some things that I can do with one that I cannot do with the other, and why?
A feature that is available to sbatch
and not to srun
is job arrays. As srun
can be used within an sbatch
script, there is nothing that you cannot do with sbatch
.
How are these related to each other, and how do they differ for srun vs sbatch?
All the parameters --ntasks
, --nodes
, --cpus-per-task
, --ntasks-per-node
have the same meaning in both commands. That is true for nearly all parameters, with the notable exception of --exclusive
.
What is happening "under the hood" that causes this to be the case?
srun
immediately executes the script on the remote host, while sbatch
copies the script in an internal storage and then uploads it on the compute node when the job starts. You can check this by modifying your submission script after it has been submitted; changes will not be taken into account (see this).
How do they interact with each other, and what is the "canonical" use-case for each of them?
You typically use sbatch
to submit a job and srun
in the submission script to create job steps as Slurm calls them. srun
is used to launch the processes. If your program is a parallel MPI program, srun
takes care of creating all the MPI processes. If not, srun
will run your program as many times as specified by the --ntasks
option. There are many use cases depending on whether your program is paralleled or not, has a long-running time or not, is composed of a single executable or not, etc. Unless otherwise specified, srun
inherits by default the pertinent options of the sbatch
or salloc
which it runs under (from here).
Specifically, would I ever use srun by itself?
Other than for small tests, no. A common use is srun --pty bash
to get a shell on a compute job.
这篇关于SLURM `srun` 与 `sbatch` 及其参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!