当集群(Slurm)取消作业时,Snakemake挂起 [英] Snakemake hangs when cluster (slurm) cancelled a job

查看:682
本文介绍了当集群(Slurm)取消作业时,Snakemake挂起的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

也许对于许多人来说答案是显而易见的,但是令我惊讶的是我找不到关于该主题的问题,这对我来说是一个重大问题. 我将不胜感激!

Maybe the answer is obvious for many, but I am quite surprised I could not find a question regarding this topic, which represents a major problem for me. I would greatly appreciate a hint!

在由slurm管理的集群上提交作业时,如果队列管理器取消了该作业(例如,由于资源或时间不足),snakemake似乎没有收到任何信号,并且永远挂起.另一方面,当作业失败时,snakemake也将失败,正如预期的那样.这种行为正常/有害吗?取消工作时,如何使snakemake也失败?我在使用蛇形版本3.13.3时遇到了这个问题,它仍然更新到5.3.0.

When submitting a job on a cluster managed by slurm, if the queue manager cancels the job (e.g. for insufficient resources or time), snakemake seems to not receive any signal, and hangs forever. On the other hand, when the job fails, also snakemake fails, as expected. Is this behavior normal/wanted? How can I have snakemake to fail also when a job gets cancelled? I had this problem with snakemake version 3.13.3 and it remained updating to 5.3.0.

例如,在这种情况下,我启动了一个简单的管道,没有足够的资源用于规则冥王星:

For example in this case I launch a simple pipeline, with insufficient resources for the rule pluto:

$ snakemake -j1 -p --cluster 'sbatch --mem {resources.mem}' pluto.txt
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 1
Unlimited resources: mem
Job counts:
    count   jobs
    1       pippo
    1       pluto
    2

[Tue Sep 25 16:04:21 2018]
rule pippo:
    output: pippo.txt
    jobid: 1
    resources: mem=1000

seq 1000000 | shuf > pippo.txt
Submitted job 1 with external jobid 'Submitted batch job 4776582'.
[Tue Sep 25 16:04:31 2018]
Finished job 1.
1 of 2 steps (50%) done

[Tue Sep 25 16:04:31 2018]
rule pluto:
    input: pippo.txt
    output: pluto.txt
    jobid: 0
    resources: mem=1

sort pippo.txt > pluto.txt
Submitted job 0 with external jobid 'Submitted batch job 4776583'.

在这里挂起.这是工作核算的内容:

Here it hangs. And here is the content of the job accounting:

$ sacct -S2018-09-25-16:04 -o jobid,JobName,state,ReqMem,MaxRSS,Start,End,Elapsed
       JobID    JobName      State     ReqMem     MaxRSS               Start                 End    Elapsed
------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ----------
4776582      snakejob.+  COMPLETED     1000Mn            2018-09-25T16:04:22 2018-09-25T16:04:27   00:00:05
4776582.bat+      batch  COMPLETED     1000Mn      1156K 2018-09-25T16:04:22 2018-09-25T16:04:27   00:00:05
4776583      snakejob.+ CANCELLED+        1Mn            2018-09-25T16:04:32 2018-09-25T16:04:32   00:00:00
4776583.bat+      batch  CANCELLED        1Mn      1156K 2018-09-25T16:04:32 2018-09-25T16:04:32   00:00:00

推荐答案

Snakemake无法识别slurm(以及其他作业计划程序)中的所有作业状态.为了弥补这一差距,snakemake提供了选项--cluster-status,可以在其中提供自定义python脚本.根据 snakemake的文档:

Snakemake doesn't recognize all kinds of job statuses in slurm (and also in other job schedulers). To bridge this gap, snakemake provides option --cluster-status, where custom python script can be provided. As per snakemake's documentation:

 --cluster-status

Status command for cluster execution. This is only considered in combination with the –cluster flag. 
If provided, Snakemake will use the status command to determine if a job has finished successfully or failed. 
For this it is necessary that the submit command provided to –cluster returns the cluster job id. 
Then, the status command will be invoked with the job id. 
Snakemake expects it to return ‘success’ if the job was successfull, ‘failed’ if the job failed and ‘running’ if the job still runs.

在蛇型文档中

显示的示例使用此功能:

Example shown in snakemake's doc to use this feature:

#!/usr/bin/env python
import subprocess
import sys

jobid = sys.argv[1]

output = str(subprocess.check_output("sacct -j %s --format State --noheader | head -1 | awk '{print $1}'" % jobid, shell=True).strip())

running_status=["PENDING", "CONFIGURING", "COMPLETING", "RUNNING", "SUSPENDED"]
if "COMPLETED" in output:
  print("success")
elif any(r in output for r in running_status):
  print("running")
else:
  print("failed")

要使用此脚本,请调用类似于下面的snakemake,其中status.py是上面的脚本.

To use this script call snakemake similar to below, where status.py is the script above.

$ snakemake all --cluster "sbatch --cpus-per-task=1 --parsable" --cluster-status ./status.py


或者,您可以将预制的自定义脚本用于几个作业调度程序(Slurm,lsf等),可通过 Snakemake配置文件.这是一个slurm-


Alternatively, you may use premade custom scripts for several job schedulers (slurm, lsf, etc), available via Snakemake-Profiles. Here is the one for slurm - slurm-status.py.

这篇关于当集群(Slurm)取消作业时,Snakemake挂起的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆