Snakemake:将多个输入用于多个子组的一个输出的规则 [英] Snakemake: rule for using many inputs for one output with multiple sub-groups
问题描述
我有一个正在使用的管道,用于下载,对齐和对公共序列数据执行变体调用.问题在于它目前只能在每个样本的基础上工作(即样本是每个单独的测序实验).如果我要对一组实验(例如样品的生物学和/或技术重复)执行变体调用,则无法使用.我已经尝试解决它,但是我无法解决它.
I have a working pipeline I'm using for downloading, aligning and performing variant calling on public sequencing data. The problem is that it can currently only work on a per-sample basis (i.e sample as each individual sequencing experiment). It doesn't work if I want to perform variant calling on a group of experiments (such as biological and/or technical replicates of a sample). I've tried to solving it, but I couldn't get it work.
这是对齐规则的简化:
rule alignment:
input:
rules.download.output.fastq
output:
'{group}/alignment/{sample}.bam'
shell:
"bash scripts/02_alignment.sh {wildcards.group} {wildcards.samples}"
与变体调用相同:
rule variant_calling:
input:
rules.alignment.output
output:
'{group}/variants/{sample}.vcf.gz'
shell:
"bash scripts/03_variant_calling.sh {wildcards.sample} {wildcards.group}"
这很好用,因为为每个对齐的.bam
文件生成了一个.vcf
文件.但是我想做的是从任意数量的.bam
文件生成一个.vcf
文件.我有一个pandas
数据框,其中包含所有sample
名称及其对应的group
.我本质上想将第二个规则的output
更改为'{group}/variants/{group}.vcf'
,但是我所做的一切都以某种方式失败了.
This works just fine, as there is a single .vcf
file generated for each aligned .bam
file. But what I would like to do is to generate a single .vcf
file from an arbitrary number of .bam
files. I have a pandas
dataframe containing all the sample
names and their corresponding group
. I would essentially like to change the output
of the second rule to be '{group}/variants/{group}.vcf'
, but everything I've done has failed in some way.
我的想法是为规则提供所有按组对齐的.bam
文件作为输入,然后只给脚本提供运行它们所在目录的脚本.问题是我找不到以这种按组方式输入的方法:要么按样本(作为工作管道)给出,要么给所有.bam
文件以进行EACH组变体调用,无论他们实际上属于哪个组.我不能只使用通配符,因为最后输出中没有{sample}
通配符.我也尝试使用函数作为输入,但这会导致与上述相同的问题.
The idea I had was to provide the rule with all the per-group aligned .bam
files as input and then just give the script it runs the directory where they all lie. The problem is that I can't find a way to give an input in this per-group manner: either I give per-sample (as the working pipeline) or I give ALL the .bam
files for EACH group variant calling, regardless of what group they actually belong to. I can't use just wildcards, as the {sample}
wildcard is not present in the last output. I also tried using functions as input, but that lead to the same problems as above.
问题的症结似乎是分组的层次:如果我想对整个数据集中的所有对齐的.bam
文件执行变体调用,那可能会很好地工作,请解决我上面提到的问题.问题来自整个数据集的子组:
The crux of the matter seem to be the layers of grouping: if I wanted to perform variant calling on all aligned .bam
files in the dataset as a whole, that would probably work fine, give my above mentioned problems. The problem comes with sub-groups for the dataset as a whole:
sample1 sample2 sample1 sample2 sample3
| | | | |
| | | | |
-------------- ---------------------------
| |
| |
group1 group2
关于如何解决此问题的任何想法?
Any ideas on how to solve this?
推荐答案
您必须使用某种结构将样本分组保存:
You have to use some kind of structure to keep your samples in groups:
GROUPS = {
"group1":["sample1","sample2"],
"group2":["sample1","sample2","sample3"]
}
然后是这样的:
rule all:
input:
expand("{group}/variants/{group}.vcf.gz", group=list(GROUPS.keys()))
rule alignment:
input:
rules.download.output.fastq
output:
'{group}/alignment/{sample}.bam'
shell:
"bash scripts/02_alignment.sh {wildcards.group} {wildcards.samples}"
rule variant_calling:
input:
lambda wildcards: expand("{group}/alignment/{sample}.bam", group=wildcards.group, sample=GROUPS[wildcards.group]
output:
'{group}/variants/{group}.vcf.gz'
shell:
"bash scripts/03_variant_calling.sh {input} {output}"
当然有您没有显示的遗漏规则,但我想您会明白的!
of course there are missing rules that you didn't show but I think you'll get the point !
规则variant_calling中的shell命令可能很难处理,但是您始终可以将目录定义为以下参数:
The shell command from rule variant_calling might be hard to handle but you can alway define a directory as params like:
params: groupAlignDir = "{group}/alignment"
并在外壳中使用它:
"bash scripts/03_variant_calling.sh {params.groupAlignDir} {output}"
并在脚本"variant_calling.sh"的目录中获取所有bam文件
and get all the bam file in the directory in your script "variant_calling.sh"
这篇关于Snakemake:将多个输入用于多个子组的一个输出的规则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!