当并非所有作业都成功输出先前规则的文件时,如何编写蛇形输入? [英] How do I write a snakemake input when not all jobs successfully output files from previous rule?

查看:36
本文介绍了当并非所有作业都成功输出先前规则的文件时,如何编写蛇形输入?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基本上,我有三个蛇形规则(除了规则全部)并且无法解决这个问题,尽管有检查点资源.

Basically, I have three snakemake rules (other than rule all) and cannot figure this problem out, despite the checkpoint resources.

规则一有我的第一个也是唯一一个文件.它将有 x 个输出(数量因输入文件而异).这 x 个输出中的每一个都需要在规则 2 中单独处理,这意味着规则 2 将运行 x 个作业.但是,这些作业中只有一部分子集 y 会产生输出(软件只写出超过特定阈值的输入文件).所以,同时我希望这些输出中的每一个都作为作业 3 中的单独作业运行,我不知道规则 2 中会产生多少文件.规则 3 也将运行 y 个作业,每个作业对应规则 2 的每个成功输出.我有两个问题.第一个是我如何编写规则 3 的输入,不知道规则 2 会输出多少个文件?第二个问题是我如何告诉"规则 2 就完成了,当输入文件没有对应数量的输出文件时?如果我添加第四条规则,我想它会尝试在没有获得输出文件的作业上重新运行第二条规则,这将永远不会产生输出.也许我在设置检查点时遗漏了什么?

Rule one has my first and only file that I start with. It will have x outputs (the number varies depending on the input file). Each of those x outputs needs to be processed separately in rule 2, meaning that rule 2 will run x jobs. However, only some subset, y, of these jobs will produce outputs (the software only writes out files for inputs that pass a certain threshold). So, while I want each of those outputs to run as a separate job in job 3, I don't know how many files will come out of rule 2. Rule three will also run y jobs, one for each successful output from rule 2. I have two questions. The first is how do I write the input for rule 3, not knowing how many files will come out of rule two? The second question is how can I "tell" rule 2 it is done, when there is not a corresponding number of output files to the input files? If I add a fourth rule, I imagine it would try to re-run rule two on jobs that didn't get an output file, which would never make an output. Maybe I am missing something with setting up the checkpoints?

类似:

rule a:
     input: file.vcf
     output: dummy.txt
     shell:"""
      .... make unknown number of output files (x) x_1 , x_2, ..., x_n 
           """ 
#run a separate job from each output of rule a
rule b:
     input: x_1 #not sure how many are going to be inputs here
     output: y_1 #not sure how many output files will be here
     shell:"""
           some of the x inputs will output their corresponding y, but others will have no output
           """
#run a separate job for each output of rule b
rule c:
     input: y_1 #not sure how many input files here
     output: z_1
 

推荐答案

您应该将 rule a 更改为评论中提到的检查点.rule b 将为每个输入生成一个输出,可以保持原样,在本例中与 rule c 相同.

You should change rule a to a checkpoint as mentioned in the comments. Rule b will generate one output for each input and can be left as is, same for rule c in this example.

最终,您将拥有一个类似聚合的规则来决定需要哪些输出.它可能是规则 d,也可能最终成为规则所有.无论哪种方式,聚合规则都需要一个输入函数来调用检查点来确定存在哪些文件.如果您按照示例,你会有类似的东西:

Eventually, you will have an aggregate-like rule that will decide which outputs are required. It could be rule d, or it may end up being rule all. Either way, the aggregate rule needs an input function that invokes the checkpoint to determine which files are present. If you follow along the example, you would have something like:

checkpoint a:
     input: file.vcf
     output: directory('output_dir')
     shell:"""
           mkdir {output}  # then put all the output files here!
      .... make unknown number of output files (x) x_1 , x_2, ..., x_n 
           """ 
#run a separate job from each output of rule a
rule b:
     input: output_dir/x_{n}
     output: y_{n}
     shell:"""
           some of the x inputs will output their corresponding y, but others will have no output
           """
#run a separate job for each output of rule b
rule c:
     input: y_{n}
     output: z_{n}

# input function for the rule aggregate
def aggregate_input(wildcards):
    checkpoint_output = checkpoints.a.get(**wildcards).output[0]
    return expand("z_{i}",
           i=glob_wildcards(os.path.join(checkpoint_output, "x_{i}.txt")).i)

rule aggregate:  # what do you do with all the z files?  could be all
    input: aggregate_input

如果您将工作流程想象成一棵树,那么规则 a 的分支数量是可变的.规则 b 和 c 是一对一的映射.Aggregate 将所有分支重新组合在一起,并负责检查存在多少个分支.规则 b 和 c 只看到一个输入/输出,不关心还有多少其他分支.

If you think of your workflow like a tree, rule a is branching with a variable number of branches. Rules b and c are one to one mappings. Aggregate brings all the branches back together AND is responsible for checking how many branches are present. Rules b and c only see one input/output and don't care how many other branches there are.

编辑以回答评论中的问题并修复了我的代码中的几个错误:

EDIT to answer the question in comment and fixed several bugs in my code:

我仍然在这里感到困惑,因为规则 b 的输出不会与输入一样多,所以规则聚合永远不会运行,直到检查点 a 的输出中的所有通配符都存在于 z_{n} 中,即他们永远不会?

I still get confused here though, because rule b will not have as many outputs as inputs, so won't rule aggregate never run until all of the wildcards from the output of checkpoint a are present in z_{n}, which they never would be?

这令人困惑,因为这不是snakemake通常的工作方式,并导致了很多关于SO的问题.您需要记住的是,当 checkpoints..get 运行时,该步骤的评估实际上会暂停.考虑 i == [1, 2, 3] 的三个值的简单情况,但在 checkpoint a 中只创建了 i == 2 and 3代码>.我们知道 DAG 将如下所示:

This is confusing because it's not how snakemake usually works and leads to a lot of questions on SO. What you need to remember is that when checkpoints.<rule>.get is run, the evaluation for that step effectively pauses. Consider the simple case of three values for i == [1, 2, 3] but only i == 2 and 3 are created in checkpoint a. We know the DAG will look like this:

rule             file
input           file.vcf
             /     |     \
a                 x_2    x_3
                   |      |
b                 y_2    y_3
                   |      |
c                 z_2    z_3
                    \     /
aggregate           OUTPUT

检查点 a 中缺少 x_1 的地方.但是,snakemake 不知道 checkpoint a 将如何表现,只是它会创建一个目录作为输出,并且(因为它是一个检查点)一旦完成,DAG 将被重新评估.所以如果你运行 snakemake -nq 你会看到 checkpoint aaggregate 会运行,但没有提到 bc.在这一点上,这些是snakemake知道并计划运行的唯一规则.调用 checkpoint..get 基本上是说在这里等一下,在这条规则之后你将不得不看看做了什么".

Where x_1 is missing from checkpoint a. But, snakemake doesn't know how checkpoint a will behave, just that it will make a directory as output and (because it is a checkpoint) once it is completed the DAG will be reevaluated. So if you ran snakemake -nq you would see checkpoint a and aggregate will run, but no mention of b or c. At that point, those are the only rules snakemake knows about and plans to run. Calling checkpoint.<rule>.get basically says "wait here, after this rule you will have to see what is made".

所以当 snakemake 第一次开始运行你的工作流时,DAG 看起来像这样:

So when snakemake first starts running your workflow, the DAG looks like this:

rule             file
input           file.vcf
                   |     
a                 ...
                   |     
????              ...
                   |     
aggregate        OUTPUT

Snakemake 不知道规则 aaggregate 之间的关系,只是它需要运行 a 才能知道.

Snakemake doesn't know what goes between rule a and aggregate, just that it needs to run a before it can tell.

rule             file
input           file.vcf
             /     |     \
a                 x_2    x_3
                        
????              ...
                   |     
aggregate        OUTPUT

检查点 a 被安排、运行,现在 DAG 被重新评估.aggregate_input 的其余部分查看 glob_wildcards 一起出现的文件,然后使用该信息来决定它需要哪些文件.请注意,扩展是从 rule c 请求输出,它需要 rule b,它需要 x_{n},它现在存在于检查点已经运行.现在,snakemake 可以构建您期望的 DAG.

Checkpoint a gets scheduled, run, and now the DAG is reevaluated. The rest of aggregate_input looks at the files that are present with glob_wildcards then uses that information to decide which files it needs. Note that the expand is requesting outputs from rule c, which requires rule b, which requires x_{n}, which exist now that the checkpoint has run. Now snakemake can construct the DAG you expect.

这是带有更多注释的输入函数,希望能说清楚:

Here's the input function with more comments to hopefully make it clear:

def aggregate_input(wildcards):
    # say this rule depends on a checkpoint.  DAG evaulation pauses here
    checkpoint_output = checkpoints.a.get(**wildcards).output[0]
    # at this point, checkpoint a has completed and the output (directory)
    # is in checkpoint_output.  Some number of files are there

    # use glob_wildcards to find the x_{i} files that actually exist
    found_files = glob_wildcards(os.path.join(checkpoint_output, "x_{i}.txt")).i
    # now we know we need all the z files to be created *if* a x file exists.
    return expand("z_{i}", i=found_files)

这篇关于当并非所有作业都成功输出先前规则的文件时,如何编写蛇形输入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆