Snakemake:将多个输入用于多个子组的一个输出的规则 [英] Snakemake: rule for using many inputs for one output with multiple sub-groups

查看：298 发布时间：2020/9/21 3:21:00 python bioinformatics snakemake

本文介绍了Snakemake:将多个输入用于多个子组的一个输出的规则的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个正在使用的管道，用于下载，对齐和对公共序列数据执行变体调用.问题在于它目前只能在每个样本的基础上工作(即样本是每个单独的测序实验).如果我要对一组实验(例如样品的生物学和/或技术重复)执行变体调用，则无法使用.我已经尝试解决它，但是我无法解决它.

I have a working pipeline I'm using for downloading, aligning and performing variant calling on public sequencing data. The problem is that it can currently only work on a per-sample basis (i.e sample as each individual sequencing experiment). It doesn't work if I want to perform variant calling on a group of experiments (such as biological and/or technical replicates of a sample). I've tried to solving it, but I couldn't get it work.

这是对齐规则的简化:

rule alignment:
    input:
        rules.download.output.fastq
    output:
        '{group}/alignment/{sample}.bam'
    shell:
        "bash scripts/02_alignment.sh {wildcards.group} {wildcards.samples}"

与变体调用相同:

rule variant_calling:
    input:
        rules.alignment.output
    output:
        '{group}/variants/{sample}.vcf.gz'
    shell:
        "bash scripts/03_variant_calling.sh {wildcards.sample} {wildcards.group}"

这很好用，因为为每个对齐的.bam文件生成了一个.vcf文件.但是我想做的是从任意数量的.bam文件生成一个.vcf文件.我有一个pandas数据框，其中包含所有sample名称及其对应的group.我本质上想将第二个规则的output更改为'{group}/variants/{group}.vcf'，但是我所做的一切都以某种方式失败了.

This works just fine, as there is a single .vcf file generated for each aligned .bam file. But what I would like to do is to generate a single .vcf file from an arbitrary number of .bam files. I have a pandas dataframe containing all the sample names and their corresponding group. I would essentially like to change the output of the second rule to be '{group}/variants/{group}.vcf', but everything I've done has failed in some way.

我的想法是为规则提供所有按组对齐的.bam文件作为输入，然后只给脚本提供运行它们所在目录的脚本.问题是我找不到以这种按组方式输入的方法:要么按样本(作为工作管道)给出，要么给所有.bam文件以进行EACH组变体调用，无论他们实际上属于哪个组.我不能只使用通配符，因为最后输出中没有{sample}通配符.我也尝试使用函数作为输入，但这会导致与上述相同的问题.

The idea I had was to provide the rule with all the per-group aligned .bam files as input and then just give the script it runs the directory where they all lie. The problem is that I can't find a way to give an input in this per-group manner: either I give per-sample (as the working pipeline) or I give ALL the .bam files for EACH group variant calling, regardless of what group they actually belong to. I can't use just wildcards, as the {sample} wildcard is not present in the last output. I also tried using functions as input, but that lead to the same problems as above.

问题的症结似乎是分组的层次:如果我想对整个数据集中的所有对齐的.bam文件执行变体调用，那可能会很好地工作，请解决我上面提到的问题.问题来自整个数据集的子组:

The crux of the matter seem to be the layers of grouping: if I wanted to perform variant calling on all aligned .bam files in the dataset as a whole, that would probably work fine, give my above mentioned problems. The problem comes with sub-groups for the dataset as a whole:

  sample1      sample2             sample1      sample2      sample3
     |            |                   |            |            |
     |            |                   |            |            |
     --------------                   ---------------------------
            |                                      |
            |                                      |
          group1                                 group2

关于如何解决此问题的任何想法?

Any ideas on how to solve this?

推荐答案

您必须使用某种结构将样本分组保存:

You have to use some kind of structure to keep your samples in groups:

GROUPS = {
    "group1":["sample1","sample2"],
    "group2":["sample1","sample2","sample3"]
}

然后是这样的:

rule all:
    input:
         expand("{group}/variants/{group}.vcf.gz", group=list(GROUPS.keys()))

rule alignment:
    input:
        rules.download.output.fastq
    output:
        '{group}/alignment/{sample}.bam'
    shell:
        "bash scripts/02_alignment.sh {wildcards.group} {wildcards.samples}"

rule variant_calling:
    input:
        lambda wildcards: expand("{group}/alignment/{sample}.bam", group=wildcards.group, sample=GROUPS[wildcards.group]
    output:
        '{group}/variants/{group}.vcf.gz'
    shell:
        "bash scripts/03_variant_calling.sh {input} {output}"

当然有您没有显示的遗漏规则，但我想您会明白的！

of course there are missing rules that you didn't show but I think you'll get the point !

规则variant_calling中的shell命令可能很难处理，但是您始终可以将目录定义为以下参数:

The shell command from rule variant_calling might be hard to handle but you can alway define a directory as params like:

params: groupAlignDir = "{group}/alignment"

并在外壳中使用它:

"bash scripts/03_variant_calling.sh {params.groupAlignDir} {output}"

并在脚本"variant_calling.sh"的目录中获取所有bam文件

and get all the bam file in the directory in your script "variant_calling.sh"

这篇关于Snakemake:将多个输入用于多个子组的一个输出的规则的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Snakemake:将多个输入用于多个子组的一个输出的规则 [英] Snakemake: rule for using many inputs for one output with multiple sub-groups

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Snakemake:将多个输入用于多个子组的一个输出的规则 [英] Snakemake: rule for using many inputs for one output with multiple sub-groups

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭