等待用户的所有作业完成,然后再将后续作业提交给PBS群集 [英] Wait for all jobs of a user to finish before submitting subsequent jobs to a PBS cluster

查看:153
本文介绍了等待用户的所有作业完成,然后再将后续作业提交给PBS群集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试调整一些bash脚本,以使其在()集群。

I am trying to adjust some bash scripts to make them run on a (pbs) cluster.

单个任务由几个脚本执行,而这些脚本由主脚本启动。
到目前为止,该主脚本在后台启动了多个脚本(通过添加& ),使它们可以在一台多核计算机上并行运行。
我想用 qsub s代替这些调用,以在整个群集节点上分配负载。

The individual tasks are performed by several script thats are started by a main script. So far this main scripts starts multiple scripts in background (by appending &) making them run in parallel on one multi core machine. I want to substitute these calls by qsubs to distribute load accross the cluster nodes.

但是,有些工作要依靠其他工作才能开始。
到目前为止,这是通过主脚本中的 wait 语句实现的。
但是使用网格引擎执行此操作的最佳方法是什么?

However, some jobs depend on others to be finished before they can start. So far, this was achieved by wait statements in the main script. But what is the best way to do this using the grid engine?

我已经找到这个问题,以及 -W after:jobid [:jobid ...] qsub 手册页中的code>文档,但我希望有更好的方法。
我们正在谈论首先要并行运行的多个thousend作业,而在最后一个完成后要同时运行的另一组相同大小的作业。
这意味着我不得不根据很多工作来排队很多工作。

I already found this question as well as the -W after:jobid[:jobid...] documentation in the qsub man page but I hope there is a better way. We are talking about several thousend jobs to run in parallel first and another set of the same size to run simultatiously after the last one of these finished. This would mean I had to queue a lot of jobs depending on a lot of jobs.

我可以通过在两者之间使用虚拟工作来降低这一点,什么都不做,只取决于第一组工作,而第二组工作可以依靠。
这会将依赖项的数量从数百万减少到数千,但仍然:感觉不对,我什至不知道shell是否会接受这么长的命令行。

I could bring this down by using a dummy job in between, doing nothing but depending on the first group of jobs, on which the second group could depend. This would decrease the number of dependencies from millions to thousands but still: It feeles wrong and I am not even sure if such a long command line would be accepted by the shell.


  • 没有办法等待所有 my 工作完成(例如 qwait -u< user>之类的东西)?

  • 还是从 this 脚本提交的所有作业(例如 qwait [-p< PID>] )?

  • Isn't there a way to wait for all my jobs to finish (something like qwait -u <user>)?
  • Or all jobs that where submitted from this script (something like qwait [-p <PID>])?

当然可以使用 qstat sleep while 循环中,但是我想这是一个用例拥有内置的解决方案非常重要,而我却无能为力。

Of course it would be possible to write something like this using qstat and sleep in a while loop, but I guess this use case is important enough to have a built in solution and I was just incapable to figure that one out.

在这种情况下,您会推荐/使用什么? / strong>

What would you recommend / use in such a situation?

附录I:

由于已在评论中提出要求:

Since it was requested in a comment:

$ qsub --version
version: 2.4.8

也许也有帮助确定确切的 pbs 系统的问题:

Maybe also helpful to determine the exact pbs system:

$ qsub --help
usage: qsub [-a date_time] [-A account_string] [-b secs]
      [-c [ none | { enabled | periodic | shutdown |
      depth=<int> | dir=<path> | interval=<minutes>}... ]
      [-C directive_prefix] [-d path] [-D path]
      [-e path] [-h] [-I] [-j oe] [-k {oe}] [-l resource_list] [-m n|{abe}]
      [-M user_list] [-N jobname] [-o path] [-p priority] [-P proxy_user] [-q queue]
      [-r y|n] [-S path] [-t number_to_submit] [-T type] [-u user_list] [-w] path
      [-W otherattributes=value...] [-v variable_list] [-V] [-x] [-X] [-z] [script]

由于注释到目前为止指向作业数组,因此我在 qsub 手册页中搜索了以下结果:

Since the comments point to job arrays so far I searched the qsub man page with the following results:

[...]
DESCRIPTION
[...]
       In addition to the above, the following environment variables will be available to the batch job.
[...]
       PBS_ARRAYID
              each member of a job array is assigned a unique identifier (see -t)
[...]
OPTIONS
[...]
       -t array_request
               Specifies the task ids of a job array. Single task arrays are allowed.
               The array_request argument is an integer id or a range of integers. Multiple ids or id ranges can be combined in a comman delimeted list. Examples : -t 1-100 or -t 1,10,50-100
[...]

附录II:

我尝试了由Dmitri Chubarov提供的解决方案,但它不能按所述方式工作。

I have tried the torque solution given by Dmitri Chubarov but it does not work as described.

在没有工作错误的情况下,它可以按预期工作:

Without the job arrray it works as expected:

testuser@headnode ~ $ qsub -W depend=afterok:`qsub ./test1.sh` ./test2 && qstat
2553.testserver.domain
Job id                  Name             User            Time Use S Queue
----------------------- ---------------- --------------- -------- - -----
2552.testserver         Test1            testuser               0 Q testqueue
2553.testserver         Test2            testuser               0 H testqueue
testuser@headnode ~ $ qstat
Job id                  Name             User            Time Use S Queue
----------------------- ---------------- --------------- -------- - -----
2552.testserver         Test1            testuser               0 R testqueue
2553.testserver         Test2            testuser               0 H testqueue
testuser@headnode ~ $ qstat
Job id                  Name             User            Time Use S Queue
----------------------- ---------------- --------------- -------- - -----
2553.testserver         Test2            testuser               0 R testqueue

但是,使用作业数组不会启动第二个作业:

However, using job arrays the second job won't start:

testuser@headnode ~ $ qsub -W depend=afterok:`qsub -t 1-2 ./test1.sh` ./test2 && qstat
2555.testserver.domain
Job id                  Name             User            Time Use S Queue
----------------------- ---------------- --------------- -------- - -----
2554-1.testserver       Test1-1          testuser               0 Q testqueue
2554-2.testserver       Test1-1          testuser               0 Q testqueue
2555.testserver         Test2            testuser               0 H testqueue
testuser@headnode ~ $ qstat
Job id                  Name             User            Time Use S Queue
----------------------- ---------------- --------------- -------- - -----
2554-1.testserver       Test1-1          testuser               0 R testqueue
2554-2.testserver       Test1-2          testuser               0 R testqueue
2555.testserver         Test2            testuser               0 H testqueue
testuser@headnode ~ $ qstat
Job id                  Name             User            Time Use S Queue
----------------------- ---------------- --------------- -------- - -----
2555.testserver         Test2            testuser               0 H testqueue

我想这是由于第一个 qsub 返回的作业ID中缺少数组指示:

I guess this is due to the lack of array indication in the job id that is returned by the first qsub:

testuser@headnode ~ $ qsub -t 1-2 ./test1.sh
2556.testserver.domain

如您所见,没有 ... [] 表示这是一个作业数组。
另外,在 qsub 输出中,没有 ... [] ,但是 ...- 1 ...- 2 表示数组。

As you can see there is no ...[] indicating this being a job array. Also, in the qsub output there are no ...[]s but ...-1 and ...-2 indicating the array.

所以剩下的问题是如何格式化 -WDepend = afterok:... 以使作业依赖于指定的作业数组。

So the remaining question is how to format -W depend=afterok:... to make a job depend on a specified job array.

推荐答案

遵循Jonathan在评论中建议的解决方案。

Filling in following the solution suggested by Jonathan in the comments.

有几个资源管理器基于原始的便携式批处理系统:OpenPBS,TORQUE和PBS Professional。系统差异很大,并且对诸如作业数组之类的新功能使用了不同的命令语法。

There are several resource managers based on the original Portable Batch System: OpenPBS, TORQUE and PBS Professional. The systems had diverged significantly and use different command syntax for newer features such as job arrays.

作业数组是基于同一作业脚本提交多个相似作业的便捷方法。从手册中引用:

Job arrays are a convenient way to submit multiple similar jobs based on the same job script. Quoting from the manual:


有时,用户可能希望基于
相同的作业脚本提交大量作业。现在不存在使用脚本重复调用 qsub 的功能,而是存在一种称为作业数组的
功能,该功能允许使用一个 qsub 创建
个多个作业

Sometimes users will want to submit large numbers of jobs based on the same job script. Rather than using a script to repeatedly call qsub, a feature known as job arrays now exists to allow the creation of multiple jobs with one qsub command.

要提交作业数组,PBS提供以下语法:

To submit a job array PBS provides the following syntax:

 qsub -t 0-10,13,15 script.sh

这将提交ID为0、1、2,...,10、13、15的作业。

this submits jobs with ids from 0,1,2,...,10,13,15.

在脚本中,变量 PBS_ARRAYID 在数组中包含作业的ID,可用于选择必要的配置。

Within the script the variable PBS_ARRAYID carries the id of the job within the array and can be used to pick the necessary configuration.

Job数组具有其特定的依赖性选项。

Job array have their specific dependency options.

TORQUE可能在OP中使用的资源管理器。在下面的示例中提供了其他依赖项选项:

TORQUE resource manager that is probably used in the OP. There additional dependency options are provided that can be seen in the following example:

$ qsub -t 1-1000 script.sh
1234[].pbsserver.domainname
$ qsub -t 1001-2000 -W depend=afterokarray:1234[] script.sh
1235[].pbsserver.domainname

这将导致以下 qstat 输出

1234[]         script.sh    user          0 R queue
1235[]         script.sh    user          0 H queue   

在扭矩版本3.0.4上进行了测试

Tested on torque version 3.0.4

完整的afterokarray语法在 qsub(1)手册。

The full afterokarray syntax is in the qsub(1) manual.

在PBS Professional中,依赖项可以在普通作业和阵列作业上统一工作。这是一个例子:

In PBS Professional dependencies can work uniformly on ordinary jobs and array jobs. Here is an example:

$ qsub -J 1-1000 -ry script.sh
1234[].pbsserver.domainname
$ qsub -J 1001-2000 -ry -W depend=afterok:1234[] script.sh
1235[].pbsserver.domainname

这将导致以下 qstat 输出

1234[]         script.sh    user          0 B queue
1235[]         script.sh    user          0 H queue   



Torque版本上的更新



自版本2.5.3起,Torque中就可以使用数组依赖项。版本2.5中的作业阵列与版本2.3或2.4中的作业阵列不兼容。特别是 [] 语法是从2.5版开始在Torque中引入的。

Update on Torque versions

Array dependencies became available in Torque since version 2.5.3. Job arrays from version 2.5 are not compatible with job arrays in versions 2.3 or 2.4. In particular the [] syntax was introduced in Torque since version 2.5.

对于2.5之前的扭矩版本,可以使用另一种解决方案,该解决方案基于在要分离的作业批次之间提交虚拟分度作业。
它使用三种依赖关系类型:上,之前和之后

For torque versions prior to 2.5 a different solution may work that is based on submitting dummy delimeter jobs between batches of jobs to be separated. It uses three dependency types: on,before, and after.

请考虑以下示例

 $ DELIM=`qsub -Wdepend=on:1000 dummy.sh `
 $ qsub -Wdepend=beforeany:$DELIM script.sh
 1001.pbsserver.domainname
 ... another 998 jobs ...
 $ qsub -Wdepend=beforeany:$DELIM script.sh
 2000.pbsserver.domainname
 $ qsub -Wdepend=after:$DELIM script.sh
 2001.pbsserver.domainname
 ...

这将导致这样的队列状态

This will result in the queue state like this

1000         dummy.sh    user          0 H queue
1001         script.sh   user          0 R queue   
...
2000         script.sh   user          0 R queue   
2001         script.sh   user          0 H queue
...   

作业#2001仅在前1000个作业终止后才运行。可能也可以使用TORQUE 2.4中提供的基本作业数组功能来提交脚本作业。

That is the job #2001 will run only after the previous 1000 jobs terminate. Probably the rudimentary job array facilities available in TORQUE 2.4 can be used as well to submit the script job.

此解决方案也适用于TORQUE 2.5版或更高版本。

This solution will also work for TORQUE version 2.5 and higher.

这篇关于等待用户的所有作业完成,然后再将后续作业提交给PBS群集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆