可以使用不同的路径/通配符定义snakemake输入规则 [英] Can a snakemake input rule be defined with different paths/wildcards
问题描述
我想知道是否可以定义一种依赖于不同通配符的输入规则.
I want to know if one can define a input rule that has dependencies on different wildcards.
为了详细说明,我正在使用qsub在不同的fastq文件上运行此Snakemake管道,它将每个作业提交到不同的节点:
To elaborate, I am running this Snakemake pipeline on different fastq files using qsub which submits each job to a different node:
- 在原始fastq上使用fastqc-没有下游对其他作业的依赖
- 适配器/质量修整以生成修整后的fastq
- 在裁剪过的fastq上的fastqc_after(步骤2的输出),没有下游依赖性
- 修剪过的fastq上的star-rsem管道(上述步骤2的输出)
- rsem和tximport(第4步的输出)
- 运行multiqc
MultiQC- https://multiqc.info/-在包含fastqc结果的结果文件夹中运行,但是,由于每个作业都在不同的节点上运行,因此有时步骤3(fastqc和/或fastqc_after)仍在节点上运行,而其他步骤完成了运行(步骤2、4和5),反之亦然
MultiQC - https://multiqc.info/ - runs on the results folder which has results from fastqc, star, rsem, etc. However, because each job runs on a different node, sometimes Step 3 (fastqc and/or fastqc_after) is still running on the nodes while other steps finish running (Steps 2, 4 and 5) OR vice-versa.
当前,我可以创建一个MultiQc规则,它等待步骤2、4、5的结果,因为它们通过输入/输出规则相互链接.
Currently, I can create a MultiQc rule which waits on results from Steps 2, 4, 5 because they are linked to each other by input/output rules.
我已将我的管道以png格式附加到该帖子.任何建议都会有所帮助.
I have attached my pipeline as png to this post. Any suggestions would help.
我需要什么:我想创建一个整理"步骤,让MultiQC等待所有步骤(从1到5)完成.换句话说,以我所附的png为指导,我想为MultiQC定义多个输入规则,这些规则也要等待fastqc的结果
What I need: I want to create a "collating" step where I want MultiQC to wait till all steps (from 1 to 5) finish. In other words, using my attached png as guide, I want to define multiple input rules for MultiQC that also wait on results from fastqc
谢谢.
注意:基于我从"科林"和"
Note: Based on comments I received from 'colin' and 'bli' after my original post, I have shared the code for the different rules here.
第1步-fastqc
Step 1 - fastqc
rule fastqc:
input: "raw_fastq/{sample}.fastq"
output: "results/fastqc/{sample}_fastqc.zip"
log: "results/logs/fq_before/{sample}.fastqc.log"
params: ...
shell: ...
第2步-bbduk
rule bbduk:
input: R1 = "raw_fastq/{sample}.fastq"
output: R1 = "results/bbduk/{sample}_trimmed.fastq",
params: ...
log: "results/logs/bbduk/{sample}.bbduk.log"
priority:95
shell: ....
第3步-fastqc_after
Step 3 - fastqc_after
rule fastqc_after:
input: "results/bbduk/{sample}_trimmed.fastq"
output: "results/bbduk/{sample}_trimmed_fastqc.zip"
log: "results/logs/fq_after/{sample}_trimmed.fastqc.log"
priority: 70
params: ...
shell: ...
第4步-star_align
Step 4 - star_align
rule star_align:
input: R1 = "results/bbduk/{sample}_trimmed.fastq"
output:
out_1 = "results/bam/{sample}_Aligned.toTranscriptome.out.bam",
out_2 = "results/bam/{sample}_ReadsPerGene.out.tab"
params: ...
log: "results/logs/star/{sample}.star.log"
priority:90
shell: ...
第5步-rsem_norm
Step 5 - rsem_norm
rule rsem_norm:
input:
bam = "results/bam/{sample}_Aligned.toTranscriptome.out.bam"
output:
genes = "results/quant/{sample}.genes.results"
params: ...
threads = 16
priority:85
shell: ...
第6步-rsem_model
Step 6 - rsem_model
rule rsem_model:
input: "results/quant/{sample}.genes.results"
output: "results/quant/{sample}_diagnostic.pdf"
params: ...
shell: ...
第7步-tximport_rsem
Step 7 - tximport_rsem
rule tximport_rsem:
input: expand("results/quant/{sample}_diagnostic.pdf",sample=samples)
output: "results/rsem_tximport/RSEM_GeneLevel_Summarization.csv"
shell: ...
第8步-multiqc
Step 8 - multiqc
rule multiqc:
input: expand("results/quant/{sample}.genes.results",sample=samples)
output: "results/multiqc/project_QS_STAR_RSEM_trial.html"
log: "results/log/multiqc"
shell: ...
推荐答案
如果希望规则multiqc
仅在fastqc
完成后才发生,则可以将fastqc
的输出添加到multiqc
的输入中:
If you want rule multiqc
to happen only after fastqc
completed, you can add the output of fastqc
to the input of multiqc
:
rule multiqc:
input:
expand("results/quant/{sample}.genes.results",sample=samples),
expand("results/fastqc/{sample}_fastqc.zip", sample=samples)
output: "results/multiqc/project_QS_STAR_RSEM_trial.html"
log: "results/log/multiqc"
shell: ...
或者,如果需要在shell
部分中引用rsem_norm
的输出,则:
Or, if you need to be able to refer to the output of rsem_norm
in your shell
section:
rule multiqc:
input:
rsem_out = expand("results/quant/{sample}.genes.results",sample=samples),
fastqc_out = expand("results/fastqc/{sample}_fastqc.zip", sample=samples)
output: "results/multiqc/project_QS_STAR_RSEM_trial.html"
log: "results/log/multiqc"
shell: "... {input.rsem_out} ..."
在您的评论之一中,您写道:
In one of your comments, you wrote:
MultiQC需要目录作为输入-我在shell命令中为其指定了结果"目录.
MultiQC needs directory as input - I give it the 'results' directory in my shell command.
如果我理解正确,则意味着results/quant/{sample}.genes.results
是目录,而不是纯文件.在这种情况下,应确保没有下游规则在这些目录内写入文件.否则,将在输出multiqc
之后将目录视为已更新,并且每次运行管道时都将重新运行multiqc
.
If I understand correctly, this means that results/quant/{sample}.genes.results
are directories, and not plain files. If this is the case, you should make sure no downstream rule writes files inside those directories. Otherwise, the directories will be considered as having been updated after the output of multiqc
, and multiqc
will be re-run every time you run the pipeline.
这篇关于可以使用不同的路径/通配符定义snakemake输入规则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!