SLURM中的--ntasks或-n任务有什么作用? [英] What does the --ntasks or -n tasks does in SLURM?

查看:1692
本文介绍了SLURM中的--ntasks或-n任务有什么作用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 SLURM 来使用某些计算群集,并且该群集具有-ntasks-n.我显然已经阅读了该文档的文档( http://slurm.schedmd.com/sbatch.html):

I was using SLURM to use some computing cluster and it had the -ntasks or -n. I have obviously read the documentation for it (http://slurm.schedmd.com/sbatch.html):

sbatch不启动任务,它请求分配资源 并提交批处理脚本.此选项建议Slurm控制器 在分配中运行的作业步骤将启动最多 编号任务并提供足够的资源.默认是 每个节点一个任务,但是请注意--cpus-per-task选项将 更改此默认值.

sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the Slurm controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.

我不明白这意味着什么的特定部分是:

the specific part I do not understand what it means is:

在分配范围内运行将启动最多数量的任务,并且 提供足够的资源.

run within the allocation will launch a maximum of number tasks and to provide for sufficient resources.

我有几个问题:

  1. 我想我的第一个问题是任务"一词的含义是什么,在SLURM上下文中与工作"一词的区别是什么.我通常认为作业是在sbatch my_batch_job.sh中的sbatch下运行bash脚本.不确定任务意味着什么.
  2. 如果我将单词task与job等同,那么我认为根据-n, --ntasks=<number>的论点,它会多次运行相同的bash脚本.但是,我显然在集群中对其进行了测试,并使用--ntask=9运行了echo hello,并且我期望sbatch会向stdout回声9次(这是在slurm-job_id.out中收集的,但令我惊讶的是,只有一次执行)我的echo hello脚本的内容那么,该命令甚至可以做什么呢?似乎它什么也没有做,或者至少我看不到该怎么做.
  1. I guess my first question is what does the word "task" mean and the difference is with the word "job" in the SLURM context. I usually think of a job as the running the bash script under sbatch as in sbatch my_batch_job.sh. Not sure what task means.
  2. If I equate the word task with job then I thought it would have ran the same identical bash script multiple times according to the argument to -n, --ntasks=<number>. However, I obviously tested it out in the cluster, ran a echo hello with --ntask=9 and I expected sbatch would echo hello 9 times to stdout (which is collected in slurm-job_id.out, but to my surprise, there was a single execution of my echo hello script Then what does this command even do? It seems it does nothing or at least I can't see whats suppose to be doing.


我确实知道-a, --array=<indexes>选项存在于多个作业中.那是一个不同的话题.我只是想知道--ntasks应该做什么,最好是举个例子,以便我可以在集群中对其进行测试.


I do know the -a, --array=<indexes> option exists for multiple jobs. That is a different topic. I simply want to know what --ntasks is suppose to do, ideally with an example so that I can test it out in the cluster.

推荐答案

如果要在同一批处理脚本中并行运行命令,则--ntasks参数很有用. 这可能是由&分隔的两个单独命令,也可能是bash管道(|)中使用的两个命令.

The --ntasks parameter is useful if you have commands that you want to run in parallel within the same batch script. This may be two separate commands separated by an & or two commands used in a bash pipe (|).

例如

使用默认的ntasks = 1

Using the default ntasks=1

#!/bin/bash

#SBATCH --ntasks=1

srun sleep 10 & 
srun sleep 12 &
wait

会抛出警告:

作业步骤创建暂时被禁用,请重试

默认情况下,任务数被指定为一个,因此,第二个任务要等到第一个任务完成后才能启动. 这项工作将在22秒钟左右完成.要对此进行细分:

The number of tasks by default was specified to one, and therefore the second task cannot start until the first task has finished. This job will finish in around 22 seconds. To break this down:

sacct -j515058 --format=JobID,Start,End,Elapsed,NCPUS

        JobID               Start                 End    Elapsed      NCPUS
------------ ------------------- ------------------- ---------- ----------
515058       2018-12-13T20:51:44 2018-12-13T20:52:06   00:00:22          1
515058.batch 2018-12-13T20:51:44 2018-12-13T20:52:06   00:00:22          1
515058.0     2018-12-13T20:51:44 2018-12-13T20:51:56   00:00:12          1
515058.1     2018-12-13T20:51:56 2018-12-13T20:52:06   00:00:10          1

在这里,任务0开始和完成(在12秒内),然后是任务1(在10秒内).使用户总时间为22秒.

Here task 0 started and finished (in 12 seconds) followed by task 1 (in 10 seconds). To make a total user time of 22 seconds.

要同时运行这两个命令,请执行以下操作:

To run both of these commands simultaneously:

#!/bin/bash

#SBATCH --ntasks=2

srun --ntasks=1 sleep 10 & 
srun --ntasks=1 sleep 12 &
wait

运行与上面指定的相同的sacct命令

Running the same sacct command as specified above

    sacct -j 515064 --format=JobID,Start,End,Elapsed,NCPUS
    JobID               Start                 End    Elapsed      NCPUS
    ------------ ------------------- ------------------- ---------- ----------
    515064       2018-12-13T21:34:08 2018-12-13T21:34:20   00:00:12          2
    515064.batch 2018-12-13T21:34:08 2018-12-13T21:34:20   00:00:12          2
    515064.0     2018-12-13T21:34:08 2018-12-13T21:34:20   00:00:12          1
    515064.1     2018-12-13T21:34:08 2018-12-13T21:34:18   00:00:10          1

这里的全部工作需要12秒钟.由于批处理脚本中已指定任务数量,因此作业没有等待资源的风险,因此作业具有立即运行这么多命令的资源.

Here the total job taking 12 seconds. There is no risk of jobs waiting for resources as the number of tasks has been specified in the batch script and therefore the job has the resources to run this many commands at once.

每个任务都继承为批处理脚本指定的参数.这就是为什么需要为每个srun任务指定--ntasks=1的原因,否则每个任务都使用--ntasks=2,因此第二个命令要等到第一个任务完成后才能运行.

Each task inherits the parameters specified for the batch script. This is why --ntasks=1 needs to be specified for each srun task, otherwise each task uses --ntasks=2 and so the second command will not run until the first task has finished.

如果将--export=NONE指定为批处理参数,则继承批处理参数的任务的另一个警告.在这种情况下,应该为每个srun命令指定--export=ALL,否则srun命令不会继承sbatch脚本中设置的环境变量.

Another caveat of the tasks inheriting the batch parameters is if --export=NONE is specified as a batch parameter. In this case --export=ALL should be specified for each srun command otherwise environment variables set within the sbatch script are not inherited by the srun command.

附加说明:
使用bash管道时,可能有必要指定--nodes = 1以防止在单独节点上运行管道的任一侧的命令.
当使用&同时运行命令时,wait是至关重要的.在这种情况下,如果没有wait命令,则任务0将自动取消,前提是任务1成功完成.

Additional notes:
When using bash pipes, it may be necessary to specify --nodes=1 to prevent commands either side of the pipes running on separate nodes.
When using & to run commands simultaneously, the wait is vital. In this case, without the wait command, task 0 would cancel itself, given task 1 completed successfully.

这篇关于SLURM中的--ntasks或-n任务有什么作用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆