snakemake,如何为两个独立参数构建循环 [英] snakemake, how to build a loop for two independent parameter

查看:159
本文介绍了snakemake,如何为两个独立参数构建循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在两个不同的通配符之间循环使用snakemake,我认为它们互不相关.

I want to loop snakemake above two different wildcards, which - I think - are somehow independent from each other.

如果这种情况下已经存在解决的威胁,我很乐意提供提示.但是到目前为止,我不确定要使用什么正确的术语来查找我想做的事情.

In case that there is already a solved threat for this case, I would be happy for a hint. But so far I'm not sure what the correct terms are to look for what I want to do.

假设我的管道包含三个步骤.我在这三个步骤中的每个步骤中都有一组样本.在第二步中,我为每个样本部署了一个额外的参数.现在,在第三步中,我必须遍历样本和相关的参数. 由于这种结构,我认为不可能用字典结构来解决这个问题.

Let's assume my pipeline has three steps. I have a set of samples which I process in each of those three steps. Put in the second step I deploy an extra parameter to every sample. In the third step now I have to iterate through the samples and the associated parameter. Because of this structure, I think it's not possible to solve this with a dictionary structure.

为说明这种情况,我简化了文件和规则: 配置文件:

To picture the case I made this simplification of my files and rules: The config file:

config.yaml

samples:
- a
- b
- c
- d
threshhold:
- 0.5
- 0.1
- 0.2

我的蛇文件的计划.它描绘了snakemake操作的确切结构和命名.至少命名是简化的. (我在括号中添加了我实际使用的工具,但我认为这对于理解并不是必不可少的.)

A scheme of my snakefile. It pictures the exact structure and naming of snakemake operations. At least the naming is simplified. (I added the tools I actually use in brackets but I think that's not essential for understanding.)

rule all: 
    input: 
      expand("{sample}.bam", sample=config["samples"]),
      expand("{sample}_{param}.bed", sample=config["samples"], param=config["threshhold"])

rule first: # (samtools view)
    input: 
       "{sample}.sam"
    output:
       "{sample}.bam"
    shell:
       "<somecommand> {input} {output}"

rule second: # ( macs2 callpeaks; Of course, there are multiple outputs but they only vary in their suffix))
    input:
       "{sample}.bam"
    output:
       "{sample}_{param}.bed"
    params:
       out_name="{sample}",
       threshhold="{param}"
    shell:
       "<somecommand> {input} -n {params.names} -q {params.threshhold}"

所以现在我有一个看起来像这样的文件列表:

So now I have a list of files looking like this:

  • a_0.5.bed
  • a_0.1.bed
  • a_0.2.bed
  • b_0.5.bed
  • b_0.1.bed
  • b_0.2.bed
  • ...

在我的第三条规则中,我想对具有相同参数的不同样本进行交集.像: a_0.5.bed x b_0.5.bed c_0.5.bed x d_0.5.bed 并得到类似 ab_0.5的输出.bed ab_0.1.bed cd_0.5.bed ...

In my third rule I want to do intersections of different samples with the same param. Like: a_0.5.bed x b_0.5.bed and c_0.5.bed x d_0.5.bed and get the output like ab_0.5.bed, ab_0.1.bed, cd_0.5.bed...

我的第一次尝试是

rule all:
    input:
        expand("ab_{param}.bed", param=config["threshhold"])

rule intersect_2: # (bedtools intersect)
    input:
        a=expand("{sample_a}_{param}_peaks.narrowPeak", sample_a=config["samples"][0], param=config["threshhold"]),
        b=expand("{sample_b}_{param}_peaks.narrowPeak", sample_b=config["samples"][1], param=config["threshhold"])
    output:
        ab="intersect/ab_{param}.bed"
    params:
        threshhold="{param}"
    shell:
        "bedtools intersect -u -a {input.a} -b {input.b} > {output.ab}"

这不起作用,因为现在输入一次是所有不同的参数文件.

Well this doesn't work because now the input are all the different parameters files at once.

我想我在这里需要更多和不同的循环结构.甚至在规则周围还有一些额外的python循环之类的东西?但是由于我完全没有编程经验,而且我只是一步一步地进入这些东西,所以我现在还不知道该从哪里开始或需要哪个循环.

I think I need more and a different loop structure here. Maybe even some extra python loops around the rule or something? But since I have no programming experience at all and I'm just getting started to go into those things step by step, I couldn't figure out by now where to start or which loop is required for this.

摘要: 使用给定的配置文件,我要存档一个文件夹,其中充满了具有相同参数的样本的不同组合.因此,最终得到一个看起来像这样的列表:

Summary: With the given config file I'd like to archive a folder wich is filled with different combinations of the samples with the same parameter. So to end up with a list which looks like this:

  • ab_0.5.bed
  • ba_0.5.bed
  • cb_0.5.bed
  • ca_0.5.bed
  • abc_0.5.bed
  • bca_0.5.bed
  • cba_0.5.bed

还有其他所有参数的组合.

And those combinations too, for all the other parameters.

我将非常感谢您提供的任何帮助和每一个提示,这些帮助和提示有助于您理解我到底想在这里做什么以及如何构建它.

I would really appreciate any help and every hint that helps to understand what I exactly want to do there and how I can build this.

完全重组的配置文件可能会有所帮助吗?样本已经预先合并到哪里了?也许是这样的:(假设s1,s2等代表真实的(并且很长的)样本名称)

Would maybe a completely restructured config file would help? Where the samples are already pre-combined? Maybe like this: (let's assume that s1, s2 and so on are standing for the real (and long) sample name)

config.yaml
samples_combinations:
- s1 : s2
- s3 : s2
- s3 : s1

我仍然必须重命名... 但是我真的不喜欢这个主意.我的目标是构建一种易于应用且简单的方法,而无需进行过多的手动优化,尤其是因为在这种情况下,您可以拥有不止三个样本,而且必须以多种方式进行组合.

I still have to rename it... But I don't really like the idea. My aim is to build something which is easily applicable and straight forward without as much manual refinement, Especially because you can have way more than just three samples in this case which have to be combined in multiple ways.

推荐答案

我想您差不多了.您可以使用简单的通配符来实现所需的功能.请记住,expand用于创建列表.保持简单:

I think you almost got it. You can achieve what you want with simple wildcards. Remember that expand is used to create lists. Keep it simple:

rule all:
    input:
        expand("intersect/ab_{param}.bed", param=config["threshhold"])

rule intersect_2: # (bedtools intersect)
    input:
        a="{sample1}_{param}_peaks.narrowPeak",
        b="{sample2}_{param}_peaks.narrowPeak"
    output:
        ab="intersect/{sample1}{sample2}_{param}.bed"
    shell:
        "bedtools intersect -u -a {input.a} -b {input.b} > {output.ab}"

关于此的一些注意事项:

Just some notes about this:

  • 使用通配符名称(或输入/输出名称)还是样本的真实名称会造成混淆.
  • 连续放置两个或多个不带分隔符的通配符非常危险

所有内容放在一起,我会写这样的东西:

All put together, I would write something like this:

import itertools

sampleCombinations = [s[0]+"-"+s[1] for s in list(itertools.combinations(config["samples"],2))]

rule all:
    input:
        expand("intersect/{pairOfSamples}_{param}.bed", pairOfSamples=sampleCombinations, param=config["threshhold"])

rule intersect_2: # (bedtools intersect)
    input:
        s1="{sample1}_{param}_peaks.narrowPeak",
        s2="{sample2}_{param}_peaks.narrowPeak"
    output:
        "intersect/{sample1}-{sample2}_{param}.bed"
    shell:
        "bedtools intersect -u -a {input.s1} -b {input.s2} > {output}"

itertools.combinations将从配置中定义的样本中配对.

itertools.combinations will make pairs out of your samples defined in config.

sampleCombinations列表将包含所有可能的样本对,并用连字符(例如:"a-b","b-c","c-d"等...)分隔.如果您的样本名称包含连字符,这可能会中断,因为snakemake将无法很好地重建通配符({sample1}-{sample2}).在这种情况下,请选择另一个分隔符.

The sampleCombinations list will hold all possible pairs of samples separated by a hyphen (ie: "a-b", "b-c", "c-d" etc...). This might break if your sample names contain hyphens, as snakemake will not be able to reconstruct the wildcards well ({sample1}-{sample2}). It that case, choose another separator.

编辑,遵循OP的评论:

很抱歉,我忘记了放置intersect文件夹输出.
我也忘记了输入文件中的_{param}_参数

Sorry in my rule all, I had forgotten to put the intersect folder output.
I also forgot the _{param}_ par in the input files

假设您有一个这样的配置文件:

Let's say you have a config file like this:

samples:
- name: very_long_name_a
  shortName: a
- name: very_long_name_b
  shortName: b
- name: very_long_name_c
  shortName: c
threshhold:
- 0.5
- 0.1
- 0.2

然后,您可以使用以下内容:

then, you can use something like this:

import itertools

configfile: "config.yaml"

longNamesList = [s["name"] for s in config["samples"]]
shortNamesList = [s["shortName"] for s in config["samples"]]
shortToLong = {s["shortName"]:s["name"] for s in config["samples"]}

sampleCombinations = [s[0]+"-"+s[1] for s in list(itertools.combinations(shortNamesList,2))]

rule all:
        input: expand("intersect/{pairOfSamples}_{param}.bed", pairOfSamples=sampleCombinations, param=config["threshhold"])


rule intersect_2: # (bedtools intersect)
    input:
        s1=lambda wildcards: shortToLong[wildcards.sample1]+"_"+wildcards.param+"_peaks.narrowPeak",
        s2=lambda wildcards: shortToLong[wildcards.sample2]+"_"+wildcards.param+"_peaks.narrowPeak"
    output:
        "intersect/{sample1}-{sample2}_{param}.bed"
    shell:
        "bedtools intersect -u -a {input.s1} -b {input.s2} > {output}"

这篇关于snakemake,如何为两个独立参数构建循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆