可以使用不同的路径/通配符定义snakemake输入规则 [英] Can a snakemake input rule be defined with different paths/wildcards

查看:250
本文介绍了可以使用不同的路径/通配符定义snakemake输入规则的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否可以定义一种依赖于不同通配符的输入规则.

I want to know if one can define a input rule that has dependencies on different wildcards.

为了详细说明,我正在使用qsub在不同的fastq文件上运行此Snakemake管道,它将每个作业提交到不同的节点:

To elaborate, I am running this Snakemake pipeline on different fastq files using qsub which submits each job to a different node:

  1. 在原始fastq上使用fastqc-没有下游对其他作业的依赖
  2. 适配器/质量修整以生成修整后的fastq
  3. 在裁剪过的fastq上的fastqc_after(步骤2的输出),没有下游依赖性
  4. 修剪过的fastq上的star-rsem管道(上述步骤2的输出)
  5. rsem和tximport(第4步的输出)
  6. 运行multiqc

MultiQC- https://multiqc.info/-在包含fastqc结果的结果文件夹中运行,但是,由于每个作业都在不同的节点上运行,因此有时步骤3(fastqc和/或fastqc_after)仍在节点上运行,而其他步骤完成了运行(步骤2、4和5),反之亦然

MultiQC - https://multiqc.info/ - runs on the results folder which has results from fastqc, star, rsem, etc. However, because each job runs on a different node, sometimes Step 3 (fastqc and/or fastqc_after) is still running on the nodes while other steps finish running (Steps 2, 4 and 5) OR vice-versa.

当前,我可以创建一个MultiQc规则,它等待步骤2、4、5的结果,因为它们通过输入/输出规则相互链接.

Currently, I can create a MultiQc rule which waits on results from Steps 2, 4, 5 because they are linked to each other by input/output rules.

我已将我的管道以png格式附加到该帖子.任何建议都会有所帮助.

I have attached my pipeline as png to this post. Any suggestions would help.

我需要什么:我想创建一个整理"步骤,让MultiQC等待所有步骤(从1到5)完成.换句话说,以我所附的png为指导,我想为MultiQC定义多个输入规则,这些规则也要等待fastqc的结果

What I need: I want to create a "collating" step where I want MultiQC to wait till all steps (from 1 to 5) finish. In other words, using my attached png as guide, I want to define multiple input rules for MultiQC that also wait on results from fastqc

谢谢.

注意:基于我从"科林"和"

Note: Based on comments I received from 'colin' and 'bli' after my original post, I have shared the code for the different rules here.

第1步-fastqc

Step 1 - fastqc

rule fastqc:
    input:  "raw_fastq/{sample}.fastq"
    output: "results/fastqc/{sample}_fastqc.zip"
    log: "results/logs/fq_before/{sample}.fastqc.log"
    params: ...
    shell: ...

第2步-bbduk

rule bbduk:
    input: R1 = "raw_fastq/{sample}.fastq"
    output: R1 = "results/bbduk/{sample}_trimmed.fastq",
    params: ...
    log: "results/logs/bbduk/{sample}.bbduk.log"
    priority:95
    shell: ....

第3步-fastqc_after

Step 3 - fastqc_after

rule fastqc_after:
    input:  "results/bbduk/{sample}_trimmed.fastq"
    output: "results/bbduk/{sample}_trimmed_fastqc.zip"
    log: "results/logs/fq_after/{sample}_trimmed.fastqc.log"
    priority: 70
    params: ...
    shell: ...

第4步-star_align

Step 4 - star_align

rule star_align:
    input: R1 = "results/bbduk/{sample}_trimmed.fastq"
    output:
        out_1 = "results/bam/{sample}_Aligned.toTranscriptome.out.bam",
        out_2 = "results/bam/{sample}_ReadsPerGene.out.tab"
    params: ...
    log: "results/logs/star/{sample}.star.log"
    priority:90
    shell: ...

第5步-rsem_norm

Step 5 - rsem_norm

rule rsem_norm:
    input:
        bam = "results/bam/{sample}_Aligned.toTranscriptome.out.bam"
    output:
        genes = "results/quant/{sample}.genes.results"
    params: ...
    threads = 16
    priority:85
    shell: ...

第6步-rsem_model

Step 6 - rsem_model

rule rsem_model:
    input: "results/quant/{sample}.genes.results"
    output: "results/quant/{sample}_diagnostic.pdf"
    params: ...      
    shell: ...

第7步-tximport_rsem

Step 7 - tximport_rsem

rule tximport_rsem:
        input: expand("results/quant/{sample}_diagnostic.pdf",sample=samples)
        output: "results/rsem_tximport/RSEM_GeneLevel_Summarization.csv"
        shell: ...

第8步-multiqc

Step 8 - multiqc

rule multiqc:
    input: expand("results/quant/{sample}.genes.results",sample=samples)
    output: "results/multiqc/project_QS_STAR_RSEM_trial.html"
    log: "results/log/multiqc"
    shell: ...

推荐答案

如果希望规则multiqc仅在fastqc完成后才发生,则可以将fastqc的输出添加到multiqc的输入中:

If you want rule multiqc to happen only after fastqc completed, you can add the output of fastqc to the input of multiqc:

rule multiqc:
    input:
        expand("results/quant/{sample}.genes.results",sample=samples),
        expand("results/fastqc/{sample}_fastqc.zip", sample=samples)
    output: "results/multiqc/project_QS_STAR_RSEM_trial.html"
    log: "results/log/multiqc"
    shell: ...

或者,如果需要在shell部分中引用rsem_norm的输出,则:

Or, if you need to be able to refer to the output of rsem_norm in your shell section:

rule multiqc:
    input:
        rsem_out = expand("results/quant/{sample}.genes.results",sample=samples),
        fastqc_out = expand("results/fastqc/{sample}_fastqc.zip", sample=samples)
    output: "results/multiqc/project_QS_STAR_RSEM_trial.html"
    log: "results/log/multiqc"
    shell: "... {input.rsem_out} ..."

在您的评论之一中,您写道:

In one of your comments, you wrote:

MultiQC需要目录作为输入-我在shell命令中为其指定了结果"目录.

MultiQC needs directory as input - I give it the 'results' directory in my shell command.

如果我理解正确,则意味着results/quant/{sample}.genes.results是目录,而不是纯文件.在这种情况下,应确保没有下游规则在这些目录内写入文件.否则,将在输出multiqc之后将目录视为已更新,并且每次运行管道时都将重新运行multiqc.

If I understand correctly, this means that results/quant/{sample}.genes.results are directories, and not plain files. If this is the case, you should make sure no downstream rule writes files inside those directories. Otherwise, the directories will be considered as having been updated after the output of multiqc, and multiqc will be re-run every time you run the pipeline.

这篇关于可以使用不同的路径/通配符定义snakemake输入规则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆