使用qsub可以提交连续且独立的作业有多快? [英] How fast can one submit consecutive and independent jobs with qsub?

查看:230
本文介绍了使用qsub可以提交连续且独立的作业有多快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题与繁忙时pbs作业无输出有关.即,当PBS/扭矩忙碌"时,我提交的某些作业不会产生任何输出.我想当许多工作一个接一个地提交时比较忙,而且碰巧的是,以这种方式提交的工作中,我经常会得到一些不产生任何输出的工作.

This question is related to pbs job no output when busy. i.e Some of the jobs I submit produce no output when PBS/Torque is 'busy'. I imagine that it is busier when many jobs are being submitted one after another, and as it so happens, of the jobs submitted in this fashion, I often get some that do not produce any output.

这里有一些代码.

假设我有一个名为"x_analyse.py"的python脚本,它将包含一些数据的文件作为输入,并分析了存储在该文件中的数据:

Suppose I have a python script called "x_analyse.py" that takes as its input a file containing some data, and analyses the data stored in the file:

./x_analyse.py data_1.pkl

现在,假设我需要: (1)准备N个此类数据文件:data_1.pkl,data_2.pkl,...,data_N.pkl (2)对每个文件进行"x_analyse.py"处理,并将结果写入每个文件的文件中. (3)由于对不同数据文件的分析都是彼此独立的,因此我将使用PBS/Torque并行运行它们以节省时间. (我认为这本质上是一个令人尴尬的并行问题".)

Now, suppose I need to: (1) Prepare N such data files: data_1.pkl, data_2.pkl, ..., data_N.pkl (2) Have "x_analyse.py" work on each of them, and write the results to a file for each of them. (3) Since the analysis of different data files are all independent of each other, I am going to use PBS/Torque to run them in parallel to save time. (I think this is essentially an 'embarrassingly parallel problem'.)

我有这个python脚本可以完成上述操作:

I have this python script to do the above:

import os
import sys
import time

N = 100

for k in range(1, N+1):
    datafilename = 'data_%d' % k
    file = open(datafilename + '.pkl', 'wb')
    #Prepare data set k, and save it in the file
    file.close()

    jobname = 'analysis_%d' % k
    file = open(jobname + '.sub', 'w')
    file.writelines( [ '#!/bin/bash\n',
                       '#PBS -N %s\n' % jobname,
                       '#PBS -o %s\n' % (jobname + '.out'),
                       '#PBS -q compute\n' ,
                       '#PBS -j oe\n' ,
                       '#PBS -l nodes=1:ppn=1\n' ,
                       '#PBS -l walltime=5:00:00\n' ,
                       'cd $PBS_O_WORKDIR\n' ,
                       '\n' ,
                       './x_analyse.py %s\n' % (datafilename + '.pkl') ] ) 
    file.close()

    os.system('qsub %s' % (jobname + '.sub')) 

    time.sleep(2.)

该脚本准备要分析的一组数据,将其保存到文件中,编写一个pbs提交文件以分析该组数据,提交要执行的工作,然后继续进行下一个操作数据集,依此类推.

The script prepares a set of data to be analysed, saves it to a file, writes a pbs submit file for analysing this set of data, submits the job to do it, and then moves onto doing the same again with the next set of data, and so on.

照原样,在运行脚本时,在提交作业时,会将作业ID列表打印到标准输出中. "ls"表示有N个.sub文件和N个.pkl数据文件. "qstat"显示所有作业的状态为"R",然后正在完成,状态为"C".但是,此后,"ls"表示由"x_analyse.py"生成的输出文件少于N个,结果文件少于N个.实际上,某些作业不会产生任何输出.如果我要清除所有内容并重新运行上面的脚本,则我将得到相同的行为,但有一些作业(但与上次操作不必相同)不会产生任何输出.

As it is, when the script is run, a list of job ids are printed to the standard output as the jobs are submitted. 'ls' shows that there are N .sub files and N .pkl data files. 'qstat' shows that all the jobs are running, with the status 'R', and are then completed, with the status 'C'. However, afterwards, 'ls' shows that there are fewer than N .out output files, and fewer than N result files produced by "x_analyse.py". In effect, no output are produced by some of the jobs. If I were to clear everything, and re-run the above script, I would get the same behaviour, with some jobs (but not necessary the same ones as last time) not producing any output.

有人建议,通过增加提交连续工作之间的等待时间,情况会有所改善.

It has been suggested that by increasing the waiting time between the submission of consecutive jobs, things improve.

time.sleep(10.) #or some other waiting time

但是我觉得这不是最令人满意的,因为我尝试了0.1s,0.5s,1.0s,2.0s,3.0s,但没有一个真正有用.有人告诉我50s的等待时间似乎可以正常工作,但是如果我必须提交100个工作,等待时间将约为5000s,这似乎太长了.

But I feel this is not most satisfactory, because I have tried 0.1s, 0.5s, 1.0s, 2.0s, 3.0s, none of which really helped. I have been told that 50s waiting time seems to work fine, but if I have to submit 100 jobs, the waiting time will be about 5000s, which seems awfully long.

我尝试通过提交作业数组来减少使用"qsub"的次数.我将像以前一样准备所有数据文件,但只有一个提交文件"analyse_all.sub":

I have tried reducing the number of times 'qsub' is used by submitting a job array instead. I would prepare all the data files as before, but only have one submit file, "analyse_all.sub":

#!/bin/bash                                                                                                                                                    
#PBS -N analyse                                                                                                                            
#PBS -o analyse.out                                                                                                                        
#PBS -q compute                                                                                                                                                
#PBS -j oe                                                                                                                                                     
#PBS -l nodes=1:ppn=1                                                                                                                                          
#PBS -l walltime=5:00:00                                                                                                                                       
cd $PBS_O_WORKDIR

./x_analyse.py data_$PBS_ARRAYID.pkl

然后提交

qsub -t 1-100 analyse_all.sub

但是即使这样,某些作业仍然无法产生输出.

But even with this, some jobs still do not produce output.

这是常见问题吗?我在做错事吗?在提交工作之间等待是否是最佳解决方案?我可以做些改进吗?

Is this a common problem? Am I doing something not right? Is waiting in between job submissions the best solution? Can I do something to improve this?

在此先感谢您的帮助.

我正在使用Torque 2.4.7版和Maui 3.3版.

I'm using Torque version 2.4.7 and Maui version 3.3.

此外,假设作业ID为1184430.mgt1的作业没有产生任何输出,而作业ID为1184431.mgt1的作业按预期的那样产生了输出,当我在这些作业上使用'tracejob'时,我得到以下信息:

Also, suppose job with job ID 1184430.mgt1 produces no output and job with job ID 1184431.mgt1 produces output as expected, when I use 'tracejob' on these I get the following:

[batman@gotham tmp]$tracejob 1184430.mgt1
/var/spool/torque/server_priv/accounting/20121213: Permission denied
/var/spool/torque/mom_logs/20121213: No such file or directory
/var/spool/torque/sched_logs/20121213: No such file or directory

Job: 1184430.mgt1

12/13/2012 13:53:13  S    enqueuing into compute, state 1 hop 1
12/13/2012 13:53:13  S    Job Queued at request of batman@mgt1, owner = batman@mgt1, job name = analysis_1, queue = compute
12/13/2012 13:53:13  S    Job Run at request of root@mgt1
12/13/2012 13:53:13  S    Not sending email: User does not want mail of this type.
12/13/2012 13:54:48  S    Not sending email: User does not want mail of this type.
12/13/2012 13:54:48  S    Exit_status=135 resources_used.cput=00:00:00  resources_used.mem=15596kb resources_used.vmem=150200kb resources_used.walltime=00:01:35
12/13/2012 13:54:53  S    Post job file processing error
12/13/2012 13:54:53  S    Email 'o' to batman@mgt1 failed: Child process '/usr/lib/sendmail -f adm batman@mgt1' returned 67 (errno 10:No child processes)
[batman@gotham tmp]$tracejob 1184431.mgt1
/var/spool/torque/server_priv/accounting/20121213: Permission denied
/var/spool/torque/mom_logs/20121213: No such file or directory
/var/spool/torque/sched_logs/20121213: No such file or directory

Job: 1184431.mgt1

12/13/2012 13:53:13  S    enqueuing into compute, state 1 hop 1
12/13/2012 13:53:13  S    Job Queued at request of batman@mgt1, owner = batman@mgt1, job name = analysis_2, queue = compute
12/13/2012 13:53:13  S    Job Run at request of root@mgt1
12/13/2012 13:53:13  S    Not sending email: User does not want mail of this type.
12/13/2012 13:53:31  S    Not sending email: User does not want mail of this type.
12/13/2012 13:53:31  S    Exit_status=0 resources_used.cput=00:00:16 resources_used.mem=19804kb resources_used.vmem=154364kb resources_used.walltime=00:00:18

对于不产生任何输出的作业,'qstat -f'返回以下内容:

Edit 2: For job that produces no output, 'qstat -f' returns the following:

[batman@gotham tmp]$qstat -f 1184673.mgt1
Job Id: 1184673.mgt1   
Job_Name = analysis_7
Job_Owner = batman@mgt1
resources_used.cput = 00:00:16
resources_used.mem = 17572kb
resources_used.vmem = 152020kb
resources_used.walltime = 00:01:36
job_state = C
queue = compute
server = mgt1
Checkpoint = u
ctime = Fri Dec 14 14:00:31 2012
Error_Path = mgt1:/gpfs1/batman/tmp/analysis_7.e1184673
exec_host = node26/0
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Fri Dec 14 14:02:07 2012
Output_Path = mgt1.gotham.cis.XXXX.edu:/gpfs1/batman/tmp/analysis_7.out
Priority = 0
qtime = Fri Dec 14 14:00:31 2012
Rerunable = True
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=1
Resource_List.walltime = 05:00:00
session_id = 9397
Variable_List = PBS_O_HOME=/gpfs1/batman,PBS_O_LANG=en_US.UTF-8, PBS_O_LOGNAME=batman,
    PBS_O_PATH=/gpfs1/batman/bin:/usr/mpi/gcc/openmpi-1.4/bin:/gpfs1/batman/workhere/instal
    ls/mygnuplot-4.4.4/bin/:/gpfs2/condor-7.4.4/bin:/gpfs2/condor-7.4.4/sb
    in:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/bin:/usr/local/bin:/bi
    n:/usr/bin:/opt/moab/bin:/opt/moab/sbin:/opt/xcat/bin:/opt/xcat/sbin,
    PBS_O_MAIL=/var/spool/mail/batman,PBS_O_SHELL=/bin/bash,
    PBS_SERVER=mgt1,PBS_O_WORKDIR=/gpfs1/batman/tmp,
    PBS_O_QUEUE=compute,PBS_O_HOST=mgt1
sched_hint = Post job file processing error; job 1184673.mgt1 on host node
    26/0Unknown resource type  REJHOST=node26 MSG=invalid home directory '
    /gpfs1/batman' specified, errno=116 (Stale NFS file handle)
etime = Fri Dec 14 14:00:31 2012
exit_status = 135  
submit_args = analysis_7.sub
start_time = Fri Dec 14 14:00:31 2012
Walltime.Remaining = 1790
start_count = 1
fault_tolerant = False
comp_time = Fri Dec 14 14:02:07 2012

与产生输出的作业相比:

as compared with a job that produces output:

[batman@gotham tmp]$qstat -f 1184687.mgt1
Job Id: 1184687.mgt1
Job_Name = analysis_1
Job_Owner = batman@mgt1
resources_used.cput = 00:00:16
resources_used.mem = 19652kb
resources_used.vmem = 162356kb
resources_used.walltime = 00:02:38
job_state = C
queue = compute
server = mgt1
Checkpoint = u
ctime = Fri Dec 14 14:40:46 2012
Error_Path = mgt1:/gpfs1/batman/tmp/analysis_1.e118468
    7
exec_host = ionode2/0
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Fri Dec 14 14:43:24 2012
Output_Path = mgt1.gotham.cis.XXXX.edu:/gpfs1/batman/tmp/analysis_1.out
Priority = 0
qtime = Fri Dec 14 14:40:46 2012
Rerunable = True   
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=1
Resource_List.walltime = 05:00:00
session_id = 28039 
Variable_List = PBS_O_HOME=/gpfs1/batman,PBS_O_LANG=en_US.UTF-8,
    PBS_O_LOGNAME=batman,
    PBS_O_PATH=/gpfs1/batman/bin:/usr/mpi/gcc/openmpi-1.4/bin:/gpfs1/batman/workhere/instal
    ls/mygnuplot-4.4.4/bin/:/gpfs2/condor-7.4.4/bin:/gpfs2/condor-7.4.4/sb
    in:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/bin:/usr/local/bin:/bi
    n:/usr/bin:/opt/moab/bin:/opt/moab/sbin:/opt/xcat/bin:/opt/xcat/sbin,
    PBS_O_MAIL=/var/spool/mail/batman,PBS_O_SHELL=/bin/bash,
    PBS_SERVER=mgt1,PBS_O_WORKDIR=/gpfs1/batman/tmp,
    PBS_O_QUEUE=compute,PBS_O_HOST=mgt1
etime = Fri Dec 14 14:40:46 2012
exit_status = 0
submit_args = analysis_1.sub
start_time = Fri Dec 14 14:40:47 2012
Walltime.Remaining = 1784
start_count = 1

似乎其中一个的退出状态为0,而不是另一个.

It appears that the exit status for one is 0 but not the other.

从上面类似的"qstat -f"输出中,似乎问题与后期作业文件处理中的"Stale NFS文件句柄"有关.通过提交数百个测试作业,我已经能够识别出许多产生失败作业的节点.通过ssh进行查找,我可以在/var/spool/torque/spool中找到丢失的PBS输出文件,在这里我还可以看到属于其他用户的输出文件.这些有问题的节点的一件奇怪的事是,如果它们是唯一被选择使用的节点,那么在它们上面的工作就很好了.仅当它们与其他节点混合时才会出现问题.

From 'qstat -f' outputs like the ones above, it seems that the problem has something to do with 'Stale NFS file handle'in the post job file processing. By submitting hundreds of test jobs, I have been able to identify a number of nodes that produce failed jobs. By sshing onto these, I can find the missing PBS output files in /var/spool/torque/spool, where I can also see output files belonging to other users. One strange thing about these problematic nodes is that if they are the only node chosen to be used, the job runs fine on them. The problem only arises when they are mixed with other nodes.

由于我不知道如何解决后处理过时的NFS文件句柄",因此我通过向其提交虚拟"作业来避免这些节点

Since I do not know how to fix the post job processing 'Stale NFS file handle', I avoid these nodes by submitting 'dummy' jobs to them

echo sleep 60 | qsub -lnodes=badnode1:ppn=2+badnode2:ppn=2

在提交真实职位之前.现在,所有作业都可以按预期方式产生输出,因此无需等待连续提交.

before submitting the real jobs. Now all jobs produce output as expected, and there is no need to wait before consecutive submissions.

推荐答案

我在失败的作业的tracejob输出中看到两个问题.

I see two issues in the tracejob output from the failed job.

首先是Exit_status=135.此退出状态不是Torque错误代码,而是脚本返回的退出状态,该脚本为x_analyse.py. Python对于使用sys.exit()函数没有约定,并且135代码的源可能在脚本中使用的模块之一中.

First it is Exit_status=135. This exit status is not a Torque error code, but an exit status returned by the script which is x_analyse.py. Python does not have a convention on the use of sys.exit() function and the source of the 135 code might be in one of the modules used in the script.

第二个问题是作业文件处理失败.这可能表明节点配置错误.

The second issue is the failure of post job file processing. This might indicate a misconfigured node.

从现在开始,我正在猜测.由于成功完成作业大约需要00:00:16,因此延迟50秒可能会使您的所有作业降落到第一个可用节点上.通过较小的延迟,您可以涉及更多节点,并最终遇到配置错误的节点,或者使两个脚本在单个节点上同时执行.我将修改提交脚本并添加一行

From now on I am guessing. Since a successful job takes about 00:00:16, it is probably true that with a delay of 50 seconds you have all your jobs land onto the first available node. With a smaller delay you get more nodes involved and eventually hit a misconfigured node or get two scripts execute concurrently on a single node. I would modify the submit script adding a line

  'echo $PBS_JOBID :: $PBS_O_HOST >> debug.log',

到生成.sub文件的python脚本.如果正确理解您的设置,这会将执行主机的名称添加到debug.log中,该文件将位于公共文件系统上.

to the python script that generates the .sub file. This would add the names of the execution hosts to the debug.log which would reside on a common filesystem if I understood your setup correctly.

然后,您(或Torque管理员)可能想在发生故障的节点上的MOM spool目录中查找未处理的输出文件,以获取一些信息以进行进一步的诊断.

Then you (or the Torque admin) might want to look for the unprocessed output files in the MOM spool directory on the failing node to get some info for further diagnosis.

这篇关于使用qsub可以提交连续且独立的作业有多快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆