Snakemake-从输入文件动态导出目标 [英] Snakemake - dynamically derive the targets from input files

查看：218 发布时间：2020/9/21 3:23:30 python python-3.x bioinformatics snakemake

本文介绍了Snakemake-从输入文件动态导出目标的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有很多这样的输入文件:

I have a large number of input files organized like this:

data/
├── set1/
│   ├── file1_R1.fq.gz
│   ├── file1_R2.fq.gz
│   ├── file2_R1.fq.gz
│   ├── file2_R2.fq.gz
|   :
│   └── fileX_R2.fq.gz
├── another_set/
│   ├── asdf1_R1.fq.gz
│   ├── asdf1_R2.fq.gz
│   ├── asdf2_R1.fq.gz
│   ├── asdf2_R2.fq.gz
|   :
│   └── asdfX_R2.fq.gz
:   
└── many_more_sets/
    ├── zxcv1_R1.fq.gz
    ├── zxcv1_R2.fq.gz
    :
    └── zxcvX_R2.fq.gz

如果您熟悉生物信息学-这些当然是配对末端测序运行中的fastq文件.我正在尝试生成一个读取所有这些内容的snakemake工作流程，但我在第一条规则上已经失败了.这是我的尝试:

If you are familiar with bioinformatics - these are of course fastq files from paired end sequencing runs. I am trying to generate a snakemake workflow that reads all of those and I'm already failing at the first rule. This is my attempt:

configfile: "config.yaml"

rule all:
    input:
        read1=expand("{output}/clipped_and_trimmed_reads/{{sample}}_R1.fq.gz", output=config["output"]),
        read2=expand("{output}/clipped_and_trimmed_reads/{{sample}}_R2.fq.gz", output=config["output"])

rule clip_and_trim_reads:
    input:
        read1=expand("{data}/{set}/{{sample}}_R1.fq.gz", data=config["raw_data"], set=config["sets"]),
        read2=expand("{data}/{set}/{{sample}}_R2.fq.gz", data=config["raw_data"], set=config["sets"])
    output:
        read1=expand("{output}/clipped_and_trimmed_reads/{{sample}}_R1.fq.gz", output=config["output"]),
        read2=expand("{output}/clipped_and_trimmed_reads/{{sample}}_R2.fq.gz", output=config["output"])
    threads: 16
    shell:
        """
        someTool -o {output.read1} -p {output.read2} \
        {input.read1} {input.read2}
        """

我无法将clip_and_trim_reads指定为目标，因为Target rules may not contain wildcards.我尝试添加all规则，但这给了我另一个错误:

I cannot specify clip_and_trim_reads as a target, because Target rules may not contain wildcards. I tried adding the all rule, but this gives me another error:

$ snakemake -np
Building DAG of jobs...
WildcardError in line 3 of /work/project/Snakefile:
Wildcards in input files cannot be determined from output files:
'sample'

我还尝试对all规则使用dynamic()函数，该函数确实确实找到了文件，但仍然给了我这个错误:

I also tried using the dynamic() function for the all rule, which weirdly did find the files, but still gave me this error:

$ snakemake -np
Dynamic output is deprecated in favor of checkpoints (see docs). It will be removed in Snakemake 6.0.
Building DAG of jobs...
MissingInputException in line 7 of /work/project/ladsie_002/analyses/scripts/2019-08-02_splice_leader_HiC/Snakefile:
Missing input files for rule clip_and_trim_reads:
data/raw_data/set1/__snakemake_dynamic___R1.fq.gz
data/raw_data/set1/__snakemake_dynamic___R2.fq.gz
data/raw_data/set1/__snakemake_dynamic___R2.fq.gz
data/raw_data/set1/__snakemake_dynamic___R1.fq.gz
[...]

我有一百多个不同的文件，因此我非常想避免创建包含所有文件名的列表.有什么想法可以实现这一目标吗?

I have over a hundred different files, so I would very much like to avoid creating a list with all filenames. Any ideas how to achieve that?

推荐答案

我认为您误会了蛇形设计的工作原理.当您运行snakemake时，您可以在命令行上定义所需的输出，否则将生成Snakefile中的第一个规则(您的所有规则)的输入.由于您未指定任何输出(snakemake -np)，Snakemake将尝试生成规则all的输入.

I think you misunderstand how snakemake works. When you run snakemake you define which output you want, either on the command-line, or else the input of the first rule in the Snakefile (your rule all) will be generated. Since you do not specify any output (snakemake -np), Snakemake will try to generate the input of rule all.

您的规则输入基本上都是:

The input of your rule all basically is:

"somepath/clipped_and_trimmed_reads/{sample}_R1.fq.gz"

不幸的是，Snakemake不知道如何从中生成输出...我们需要告诉Snakemake要使用哪些文件.我们可以手动执行此操作:

Snakemake, unfortunately, does not know how to generate output from this... We need to tell Snakemake which files to use. We can do this manually:

rule all:
    input:
        "somepath/clipped_and_trimmed_reads/file1_R1.fq.gz",
        "somepath/clipped_and_trimmed_reads/asdf1_R1.fq.gz",
        "somepath/clipped_and_trimmed_reads/zxcv1_R1.fq.gz"

但是随着我们获得更多文件，这变得非常麻烦，而且正如您在问题中所指定的，这不是您想要的.我们需要编写一个小的函数来为我们获取所有文件名.

But this becomes quite cumbersome as we get more files, and as you specify in the question, is not what you want. We need to write a small function that that gets all the filenames for us.

import glob
import re

data=config["raw_data"]
samples = []
locations = {}
for file in glob.glob(data + "/**", recursive=True):
    if '_R1.fq.gz' in file:
        split = re.split('/|_R1', file)
        filename, directory = split[-2], split[-3]
        locations[filename] = directory  # we will need this one later
        samples.append(filename)

我们现在可以将其全部输入规则:

We can now feed this to our rule all:

rule all:
    input:
        read1=expand("{output}/clipped_and_trimmed_reads/{sample}_R1.fq.gz", output=config["output"], sample=samples),
        read2=expand("{output}/clipped_and_trimmed_reads/{sample}_R2.fq.gz", output=config["output"], sample=samples)

请注意，我们不再拥有样本作为通配符，但是我们将其扩展"到了read1和read2中，从而实现了输出和样本的所有可能组合.

Note that we do not have sample as a wildcard anymore, but we 'expand' it into our read1 and read2, making all possible combinations of output and sample.

但是，我们只完成了一半.如果我们这样称呼Snakemake，它将确切地知道我们想要哪个输出，以及哪个规则可以生成此输出(规则clip_and_trim_reads).但是，它仍然不知道在哪里查找这些文件.幸运的是，我们已经有了一个字典，用于存储这些字典(存储在位置中).

We are only halfway done however.. If we call Snakemake like this, it will know exactly which output we want, and which rule can generate this (rule clip_and_trim_reads). However it still does not know where to look for these files. Fortunately we already have a dictionary where we stored these (stored in locations).

rule clip_and_trim_reads:
    input:
        read1=lambda wildcards: expand("{data}/{set}/{sample}_R1.fq.gz", data=config["raw_data"], set=locations[wildcards.sample], sample=wildcards.sample),
        read2=lambda wildcards: expand("{data}/{set}/{sample}_R2.fq.gz", data=config["raw_data"], set=locations[wildcards.sample], sample=wildcards.sample)
    output:
        ...

现在一切都应该正常！！甚至更好；由于我们将来自clip_and_trim_reads规则的所有结果都写入单个文件夹，因此从此处继续操作应该容易得多！

Now everything should work!! And even better; since all our results from the rule clip_and_trim_reads are written to a single folder, continuing from here should be much easier!

p.s.我没有测试任何代码，因此可能并非所有内容都能在第一次尝试时就可以正常工作.但是，该消息仍然存在.

p.s. I haven't tested any of the code, so probably not everything works on the first try. However the message remains.

这篇关于Snakemake-从输入文件动态导出目标的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Snakemake-从输入文件动态导出目标 [英] Snakemake - dynamically derive the targets from input files

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Snakemake-从输入文件动态导出目标 [英] Snakemake - dynamically derive the targets from input files

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭