Snakemake:将多个输入用于多个子组的一个输出的规则 [英] Snakemake: rule for using many inputs for one output with multiple sub-groups

查看:298
本文介绍了Snakemake:将多个输入用于多个子组的一个输出的规则的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个正在使用的管道,用于下载,对齐和对公共序列数据执行变体调用.问题在于它目前只能在每个样本的基础上工作(样本是每个单独的测序实验).如果我要对一组实验(例如样品的生物学和/或技术重复)执行变体调用,则无法使用.我已经尝试解决它,但是我无法解决它.

I have a working pipeline I'm using for downloading, aligning and performing variant calling on public sequencing data. The problem is that it can currently only work on a per-sample basis (i.e sample as each individual sequencing experiment). It doesn't work if I want to perform variant calling on a group of experiments (such as biological and/or technical replicates of a sample). I've tried to solving it, but I couldn't get it work.

这是对齐规则的简化:

rule alignment:
    input:
        rules.download.output.fastq
    output:
        '{group}/alignment/{sample}.bam'
    shell:
        "bash scripts/02_alignment.sh {wildcards.group} {wildcards.samples}"

与变体调用相同:

rule variant_calling:
    input:
        rules.alignment.output
    output:
        '{group}/variants/{sample}.vcf.gz'
    shell:
        "bash scripts/03_variant_calling.sh {wildcards.sample} {wildcards.group}"

这很好用,因为为每个对齐的.bam文件生成了一个.vcf文件.但是我想做的是从任意数量的.bam文件生成一个.vcf文件.我有一个pandas数据框,其中包含所有sample名称及其对应的group.我本质上想将第二个规则的output更改为'{group}/variants/{group}.vcf',但是我所做的一切都以某种方式失败了.

This works just fine, as there is a single .vcf file generated for each aligned .bam file. But what I would like to do is to generate a single .vcf file from an arbitrary number of .bam files. I have a pandas dataframe containing all the sample names and their corresponding group. I would essentially like to change the output of the second rule to be '{group}/variants/{group}.vcf', but everything I've done has failed in some way.

我的想法是为规则提供所有按组对齐的.bam文件作为输入,然后只给脚本提供运行它们所在目录的脚本.问题是我找不到以这种按组方式输入的方法:要么按样本(作为工作管道)给出,要么给所有.bam文件以进行EACH组变体调用,无论他们实际上属于哪个组.我不能只使用通配符,因为最后输出中没有{sample}通配符.我也尝试使用函数作为输入,但这会导致与上述相同的问题.

The idea I had was to provide the rule with all the per-group aligned .bam files as input and then just give the script it runs the directory where they all lie. The problem is that I can't find a way to give an input in this per-group manner: either I give per-sample (as the working pipeline) or I give ALL the .bam files for EACH group variant calling, regardless of what group they actually belong to. I can't use just wildcards, as the {sample} wildcard is not present in the last output. I also tried using functions as input, but that lead to the same problems as above.

问题的症结似乎是分组的层次:如果我想对整个数据集中的所有对齐的.bam文件执行变体调用,那可能会很好地工作,请解决我上面提到的问题.问题来自整个数据集的子组:

The crux of the matter seem to be the layers of grouping: if I wanted to perform variant calling on all aligned .bam files in the dataset as a whole, that would probably work fine, give my above mentioned problems. The problem comes with sub-groups for the dataset as a whole:

  sample1      sample2             sample1      sample2      sample3
     |            |                   |            |            |
     |            |                   |            |            |
     --------------                   ---------------------------
            |                                      |
            |                                      |
          group1                                 group2

关于如何解决此问题的任何想法?

Any ideas on how to solve this?

推荐答案

您必须使用某种结构将样本分组保存:

You have to use some kind of structure to keep your samples in groups:

GROUPS = {
    "group1":["sample1","sample2"],
    "group2":["sample1","sample2","sample3"]
}

然后是这样的:

rule all:
    input:
         expand("{group}/variants/{group}.vcf.gz", group=list(GROUPS.keys()))

rule alignment:
    input:
        rules.download.output.fastq
    output:
        '{group}/alignment/{sample}.bam'
    shell:
        "bash scripts/02_alignment.sh {wildcards.group} {wildcards.samples}"

rule variant_calling:
    input:
        lambda wildcards: expand("{group}/alignment/{sample}.bam", group=wildcards.group, sample=GROUPS[wildcards.group]
    output:
        '{group}/variants/{group}.vcf.gz'
    shell:
        "bash scripts/03_variant_calling.sh {input} {output}"

当然有您没有显示的遗漏规则,但我想您会明白的!

of course there are missing rules that you didn't show but I think you'll get the point !

规则variant_calling中的shell命令可能很难处理,但是您始终可以将目录定义为以下参数:

The shell command from rule variant_calling might be hard to handle but you can alway define a directory as params like:

params: groupAlignDir = "{group}/alignment"

并在外壳中使用它:

"bash scripts/03_variant_calling.sh {params.groupAlignDir} {output}"

并在脚本"variant_calling.sh"的目录中获取所有bam文件

and get all the bam file in the directory in your script "variant_calling.sh"

这篇关于Snakemake:将多个输入用于多个子组的一个输出的规则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆