Snakemake:关于如何正确访问配置文件的困惑 [英] Snakemake: confusion on how to access config files properly

查看:96
本文介绍了Snakemake:关于如何正确访问配置文件的困惑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题是我之前问过的问题它涉及了解如何使用Snakemake正确访问配置文件.我有一个特定的问题,我需要首先解决,而在理解索引工作原理时则要解决一个一般性的问题.

This question follows on from a question I asked previously and it regards understanding how to access config files correctly using Snakemake. I have a specific problem I need to address which I'll ask first and a general problem understanding how indexing works which I'll ask second.

我正在使用snakemake并运行从Alignment/QC到模体分析的ATAC-seq管道.

I'm using snakemake to run and ATAC-seq pipeline from Alignment/QC through to motif analysis.

A:具体问题

我正在尝试添加一条称为 trim_galore_pe 的规则,以在对齐之前从我的fastq文件中修剪适配器,并从snakemake抛出错误声明,作为由 trim生成的输出文件的名称丰富与snakemake期望的不符.这是因为我无法弄清楚如何在我的snakemake文件中正确写入输出文件语句以使名称匹配.

I'm trying to add a rule called trim_galore_pe to trim adapters from my fastq files before alignment and an error statement is thrown from snakemake as the names of the output files generated by trim galore do not match what is expected by snakemake. This is because I cannot work out how to write the output file statement correctly in my snakemake file to make the names match.

TRIM GALORE 生成的名称示例包含SRA号,例如:

An example of the names generated by TRIM GALORE contain SRA numbers, for example:

trimmed_fastq_files/SRR2920475_1_val_1.fq.gz

snakemake期望的文件包含 sample 引用,并应显示为:

Whereas the file expected by snakemake contain sample references and should read:

trimmed_fastq_files/Corces2016_4983.7B_Mono_1_val_1.fq.gz

这也会影响 trim_galore_pe 规则之后的后续规则.我需要找到一种方法来使用配置文件中的信息来生成所需的输出文件.

This also affects the subsequent rules after the trim_galore_pe rule. I need to work out a way to use the info in my config file to generate the output files required.

对于Snakefile中显示的规则之后的所有规则,我需要使用样本名称(即 Corces2016_4983.7A_Mono )命名文件.对于下面的Snakefile中显示的所有 FAST_QC MULTIQC 规则,在输出文件名结构中具有样本名称也是很有用的,他们都已经在当前的Snakefile中执行了这些操作.

For all rules after those shown in the Snakefile I need files to be named by sample name i.e. Corces2016_4983.7A_Mono. It would also be useful for all the FAST_QC and MULTIQC rules shown in the Snakefile below to have the sample names in the output file name structure, which they all already do in the current Snakefile.

但是,Bowtie2的输入,FASTQC规则以及 trim_galore_pe 规则的输入和输出需要包含SRA号.问题始于 trim_galore 的输出,并影响所有下游规则.

However, inputs for Bowtie2, the FASTQC rules and input and output of the trim_galore_pe rules need to contain the SRA numbers. The problem starts at the output of trim_galore and influences all downstream rules.

尽管我已经按照以前的规则提取了SRA编号,但是当不使用配置文件中明确指出的 fastq_files 文件夹时,我不确定如何执行此操作.通过引入 trim_galore_pe 规则,我有效地将一组新的SRA文件移动到新的 trimmed_fastq_files 文件夹中.如何从包含旧文件夹名称的SRA文件配置文件列表中提取 only SRA编号,同时引用Snakefile中新的 trimmed_fastq_files 文件夹,这是我问题的症结所在.

Although I have extracted SRA numbers in previous rules, I'm not sure how to do this when not using the fastq_files folder which is explicitly stated in the config file. By introducing the trim_galore_pe rule I have effectively moved a new set of SRA files into the new trimmed_fastq_files folder. How to extract only the SRA number from the list of SRA files config file containing the old folder names whilst referencing the new trimmed_fastq_files folder in the Snakefile is the crux of my issue.

我希望这很清楚.

这是我的配置文件:

samples:
    Corces2016_4983.7A_Mono: fastq_files/SRR2920475
    Corces2016_4983.7B_Mono: fastq_files/SRR2920476
cell_types:
    Mono:
    - Corces2016_4983.7A
index: /home/genomes_and_index_files/hg19

这是我的Snakefile:

Here is my Snakefile:

# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])

rule all:
    input:
        expand("FastQC/PRETRIM/{sample}_{num}_fastqc.zip", sample=config["samples"], num=['1', '2']),
        expand("bam_files/{sample}.bam", sample=config["samples"]),
        "FastQC/PRETRIM/fastq_multiqc.html",
        "FastQC/POSTTRIM/fastq_multiqc.html"

rule fastqc_pretrim:
    input:
        sample=lambda wildcards: f"{config['samples'][wildcards.sample]}_{wildcards.num}.fastq.gz"
    output:
        # Output needs to end in '_fastqc.html' for multiqc to work
        html="FastQC/PRETRIM/{sample}_{num}_fastqc.html",
        zip="FastQC/PRETRIM/{sample}_{num}_fastqc.zip"
    wrapper:
        "0.23.1/bio/fastqc"

rule multiqc_fastq_pretrim:
    input:
        expand("FastQC/PRETRIM/{sample}_{num}_fastqc.html", sample=config["samples"], num=['1', '2'])
    output:
        "FastQC/PRETRIM/fastq_multiqc.html"
    wrapper:
        "0.23.1/bio/multiqc"

rule trim_galore_pe:
    input:
        sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
    output:
        "trimmed_fastq_files/{sample}_1_val_1.fq.gz",
        "trimmed_fastq_files/{sample}_1.fastq.gz_trimming_report.txt",
        "trimmed_fastq_files/{sample}_2_val_2.fq.gz",
        "trimmed_fastq_files/{sample}_2.fastq.gz_trimming_report.txt"
    params:
        extra="--illumina -q 20"
    log:
        "logs/trim_galore/{sample}.log"
    wrapper:
        "0.23.1/bio/trim_galore/pe"

rule fastqc_posttrim:
    input:
        "trimmed_fastq_files/{sample}_1_val_1.fq.gz", "trimmed_fastq_files/{sample}_2_val_2.fq.gz"
    output:
        # Output needs to end in '_fastqc.html' for multiqc to work
        html="FastQC/POSTTRIM/{sample}_{num}_fastqc.html",
        zip="FastQC/POSTTRIM/{sample}_{num}_fastqc.zip"
    wrapper:
        "0.23.1/bio/fastqc"

rule multiqc_fastq_posttrim:
    input:
        expand("FastQC/POSTTRIM/{sample}_{num}.trim_fastqc.html", sample=config["samples"], num=['1', '2'])
    output:
        "FastQC/POSTTRIM/fastq_multiqc.html"
    wrapper:
        "0.23.1/bio/multiqc"

rule bowtie2:
    input:
        "trimmed_fastq_files/{sample}_1_val_1.fq.gz", "trimmed_fastq_files/{sample}_2_val_2.fq.gz"
    output:
        "bam_files/{sample}.bam"
    log:
        "logs/bowtie2/{sample}.txt"
    params:
       index=config["index"],  # prefix of reference genome index (built with bowtie2-build),
       extra=""
    threads: 8
    wrapper:
        "0.23.1/bio/bowtie2/align"

当前运行,并使用 snakemake -np 给出了完整的作业列表,但抛出了上面提到的错误.

This currently runs, and give a full job list using snakemake -np, but throws the error mentioned above.

B:常见问题

是否有在线资源简要说明了如何使用python来引用配置文件,尤其是关于蛇形的引用?在线文档还远远不够,并且假定您对python有先验知识.

Is there an online resource that explains succinctly how to reference a config file using python, particularly with reference to snakemake? The online docs are pretty insufficient and assume prior knowledge of python.

我的编程经验主要是在bash和R中工作,但我喜欢Snakemake,并且通常了解字典和列表在python中的工作方式以及如何引用其中存储的项目.但是,我发现上述某些Snakemake规则中括号,通配符和反逗号的复杂用法令人困惑,因此当尝试在配置文件中引用文件名的不同部分时,往往会遇到困难.我想充分了解如何利用这些元素.

My programming experience is mainly in bash and R but I like Snakemake and do generally understand how dictionaries and lists work in python and how to reference items stored within them. However I find the complex use of bracketing, wildcards and inverted commas in some of the Snakemake rules above confusing so tend to struggle when trying to reference different parts of file names in the config file. I want to understand fully how to utilise these elements.

例如,在上面发布的Snakefile中的此类规则中:

For example, in a rule such as this from the Snakefile posted above:

sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2]) 

此命令中实际发生了什么?我的理解是,我们正在使用 config ['samples'] 访问配置文件,并且正在使用 [wildcards.sample] 部分来显式访问 fastq_files/SRR2920475 部分的配置文件.扩展使我们可以遍历配置文件中适合命令参数的所有项,即所有SRA文件,并且需要lambda通配符才能在命令中使用 wildcards 调用.我不确定的是:

What is actually happening in this command? My understanding is that we are accessing the config file using config['samples'] and we are using the [wildcards.sample] part to explicitly access the fastq_files/SRR2920475 part of the config file. The expand allows us to iterate through each item in the config file that fit the parameters in the command, i.e all the SRA files, and the lambda wildcards is needed to use the wildcards call in the command. What I'm uncertain about is:

  1. 展开后, f 会做什么?为什么需要它?
  2. 为什么 config ['samples'] 在方括号中包含反逗号,但是在 [wildcards.sample] 周围不需要反逗号?
  3. 为什么要使用单个和两个大括号?
  4. 请看上面的Snakefile,其中一些规则包含为 num 分配一个数字序列的部分,但是这些数字有时又被用引号括起来,有时却不...为什么?
  1. What does the f do just after the expand and why is it needed?
  2. Why does config['samples'] contain inverted commas within the square brackets but but inverted commas are not needed around [wildcards.sample]?
  3. Why are the single and double curly brackets used?
  4. Looking at the Snakefile above, a few of the rules contain parts assigning a sequence of numbers to num, but again these numbers are sometimes enclosed around inverted commas and sometimes not...why?

任何建议,技巧和指针将不胜感激.

Any advice, tips, pointers would be greatly appreciated.

C:澄清@bli在下面提出的建议

我已按照注释中的建议编辑了配置文件,并且省略了文件夹名称,仅保留了SRA编号.这对我来说很有意义,但是我还有其他一些问题使我无法运行此Snakefile.

I have edited my config file as you suggested in your comment and omitted the folder names leaving only the SRA numbers. This make sense to me, but I have a couple of other issues preventing me getting this Snakefile running.

新的配置文件:

samples:
    Corces2016_4983.7A_Mono: SRR2920475
    Corces2016_4983.7B_Mono: SRR2920476
cell_types:
    Mono:
    - Corces2016_4983.7A
index: /home/c1477909/genomes_and_index_files/hg19

新的Snakefile:

New Snakefile:

# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])

rule all:
    input:
        expand("FastQC/PRETRIM/{sample}_{num}_fastqc.zip", sample=config["samples"], num=['1', '2']),
        expand("bam_files/{sample}.bam", sample=config["samples"]),
        "FastQC/PRETRIM/fastq_multiqc.html",
        "FastQC/POSTTRIM/fastq_multiqc.html",

rule fastqc_pretrim:
    input:
      lambda wildcards: f"fastq_files/{config['samples'][wildcards.sample]}_{wildcards.num}.fastq.gz"
    output:
        # Output needs to end in '_fastqc.html' for multiqc to work
        html="FastQC/PRETRIM/{sample}_{num}_fastqc.html",
        zip="FastQC/PRETRIM/{sample}_{num}_fastqc.zip"
    wrapper:
        "0.23.1/bio/fastqc"

rule multiqc_fastq_pretrim:
    input:
        expand("FastQC/PRETRIM/{sample}_{num}_fastqc.html", sample=config["samples"], num=['1', '2'])
    output:
        "FastQC/PRETRIM/fastq_multiqc.html"
    wrapper:
        "0.23.1/bio/multiqc"

rule trim_galore_pe:
    input:
        lambda wildcards: expand(f"fastq_files/{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
    output:
        "trimmed_fastq_files/{wildcards.sample}_1_val_1.fq.gz",
        "trimmed_fastq_files/{wildcards.sample}_1.fastq.gz_trimming_report.txt",
        "trimmed_fastq_files/{wildcards.sample}_2_val_2.fq.gz",
        "trimmed_fastq_files/{wildcards.sample}_2.fastq.gz_trimming_report.txt"
    params:
        extra="--illumina -q 20"
    log:
        "logs/trim_galore/{sample}.log"
    wrapper:
        "0.23.1/bio/trim_galore/pe"

rule fastqc_posttrim:
    input:
        lambda wildcards: expand(f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}_val_{{num}}.fq.gz", num=[1,2])
    output:
        # Output needs to end in '_fastqc.html' for multiqc to work
        html="FastQC/POSTTRIM/{sample}_{num}_fastqc.html",
        zip="FastQC/POSTTRIM/{sample}_{num}_fastqc.zip"
    wrapper:
        "0.23.1/bio/fastqc"

rule multiqc_fastq_posttrim:
    input:
        expand("FastQC/POSTTRIM/{sample}_{num}.trim_fastqc.html", sample=config["samples"], num=['1', '2'])
    output:
        "FastQC/POSTTRIM/fastq_multiqc.html"
    wrapper:
        "0.23.1/bio/multiqc"

rule bowtie2:
    input:
        lambda wildcards: expand(f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}_val_{{num}}.fq.gz", num=[1,2])
    output:
        "bam_files/{sample}.bam"
    log:
        "logs/bowtie2/{sample}.txt"
    params:
        index=config["index"],  # prefix of reference genome index (built with bowtie2-build),
        extra=""
    threads: 8
    wrapper:
        "0.23.1/bio/bowtie2/align"

使用这些新文件,最初一切都可以正常工作, snakemake -np 创建了部分作业列表.但是,这是因为完整作业列表的一半已经运行;即生成了 trimmed_fastq_files 文件夹,并在其中放置了正确命名的修剪过的fastq文件.当我删除所有先前创建的文件以查看Snakefile的整个新版本是否正常运行时, snakemake -np 失败,表明下游规则缺少输入文件.trim_galore_pe 规则.

Using these new files everything worked fine initially, a partial job list was created by snakemake -np. However, this is because half of the complete job list had already been run; that is the trimmed_fastq_files folder was generated and the correctly named trimmed fastq files were in place within it. When I deleted all the previously created files to see if the entire new version of the Snakefile would work properly, snakemake -np failed, stating that there were missing input files for the rules downstream of the trim_galore_pe rule.

如您所见,我正在尝试在输出部分的 trim_galore_pe 规则的输入部分中调用设置的 {wildcard.sample} 变量,但是snakemake不喜欢这样可以这样做吗?

As you can see I'm trying to call the {wildcard.sample} variable set in the input section of the trim_galore_pe rule in the output section, but snakemake doesn't like this. Is is possible to do this?

我还使用以下答案中的提示尝试了此操作,但这也不起作用:

I also tried this using the tips from the answers below but this didn't work either:

rule trim_galore_pe:
    input:
        sample=lambda wildcards: expand(f"fastq_files/{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
    output:
        expand(f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}_val_{{num}}.fq.gz", num=[1,2]),
        expand(f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}.fastq.gz_trimming_report.txt", num=[1,2])
    params:
        extra="--illumina -q 20"
    log:
        "logs/trim_galore/{sample}.log"
    wrapper:
        "0.23.1/bio/trim_galore/pe"

该错误随后显示为未定义通配符.因此,从逻辑上讲,我尝试将 lambda通配符:放在输出部分的两个expand语句之前,以尝试定义通配符,但这引发了语法错误,只能输入文件指定为函数.我还尝试使用下面的一些索引建议,但无法获得正确的组合.

The error then stated wildcards not defined. So, logically I tried putting lambda wildcards: in front of the two expand statements of the output section in an attempt to define the wildcards, but this threw a syntax error, Only input files can be specified as functions. I also tried using some of the indexing suggestions below but couldn't get the right combination.

这可能是由我不确定有关Snakefiles的另一件事引起的,这就是作用域确定的工作方式.

This is probably caused by another thing I'm unsure about regarding Snakefiles and that is how scoping works.

  • 如果我在 rule all 中定义了一个变量,所有其他规则都可以访问它吗?
  • 如果我在规则的输入部分中定义了一个变量,该变量是否可用于该规则的所有其他部分(即输出,shell命令等),但仅适用于该规则?
  • 如果是,如果在输入部分中定义了 {wildcard.sample} 变量,为什么不能访问它?那是因为该变量包含在封闭"范围的lambda函数中吗?
  • If I define a variable in rule all can all other rules access it?
  • If I define a variable in the input section of a rule, is it available to all other sections of that rule (i.e. output, shell command etc.), but only that rule?
  • If yes, why can't I access the {wildcard.sample} variable if I defined it in the input section? Is that because that variable is enclosed within a 'closed' scope lambda function?

任何(进一步的)建议将不胜感激.

Any (further) suggestions would be greatly appreciated.

推荐答案

我将尝试回答您的问题B,并提供更多详细信息,希望对您和其他人有用.

I'll try to answer your question B, and give extra details that I hope can be useful for you and others.

我在结尾处添加了一些尝试回答问题C的尝试.

I added some attempts at answering question C at the end.

首先,关于所谓的反"逗号,通常将它们称为单引号",并且在python中使用它们来构建字符串.双引号也可以用于相同的目的.主要区别在于,当您尝试创建包含引号的字符串时.使用双引号允许您创建包含单引号的字符串,反之亦然.否则,您需要使用反斜杠("\")转义"引号:

First, regarding what you call "inverted" commas, they are usually called "single quotes", and they are used in python to build strings. Double quotes can also be used for the same purpose. The main difference is when you try to create strings that contain quotes. Using double quotes allows you to create strings containing single quotes, and vice-versa. Otherwise, you need to "escape" the quote using backslashes ("\"):

s1 = 'Contains "double quotes"'
s1_bis = "Contains \"double quotes\""
s2 = "Contains 'single quotes'"
s2_bis = 'Contains \'single quotes\''

(我倾向于使用双引号,这只是个人喜好.)

(I tend to prefer double quotes, that's just a personal taste.)

rule trim_galore_pe:
    input:
        sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])

您正在将函数( lambda通配符:... )分配给变量( sample ),而该变量恰好属于规则的输入部分.

You are assigning a function (lambda wildcards: ...) to a variable (sample), which happens to belong to the input section of a rule.

这将使snakemake在基于通配符的当前值(从要生成的输出的当前值推断出)来确定规则的特定实例的输入时使用此功能.

This will cause snakemake to use this function when it comes to determine the input of a particular instance of the rule, based on the current values of the wildcards (as inferred from the current value of the output it wants to generate).

为清楚起见,很可能可以通过将函数定义与规则声明分开来重写此代码,而无需使用 lambda 构造,并且其工作原理相同:

For clarity, one could very likely rewrite this by separating the function definition from the rule declaration, without using the lambda construct, and it would work identically:

def determine_sample(wildcards):
    return expand(
        f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz",
        num=[1,2])

rule trim_galore_pe:
    input:
        sample = determine_sample

expand 是snakemake特有的功能(但您可以使用 from snakemake.io import expand 将其导入任何python程序或交互式解释器中).生成字符串列表.在下面的交互式python3.6会话中,我们将尝试使用不同的本机python构造来重现使用它时发生的情况.

expand is a snakemake-specific function (but you can import it in any python program or interactive interpreter with from snakemake.io import expand), that makes it easier to generate lists of strings. In the following interactive python3.6 session we will try to reproduce what happens when you use it, using different native python constructs.

# We'll try to see how `expand` works, we can import it from snakemake
from snakemake.io import expand
    ​
# We want to see how it works using the following example
# expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])

# To make the example work, we will first simulate the reading
# of a configuration file
import yaml

config_text = """
samples:
    Corces2016_4983.7A_Mono: fastq_files/SRR2920475
    Corces2016_4983.7B_Mono: fastq_files/SRR2920476
cell_types:
    Mono:
    - Corces2016_4983.7A
index: /home/genomes_and_index_files/hg19
"""
# Here we used triple quotes, to have a readable multi-line string.
​
# The following is equivalent to what snakemake does with the configuration file:
config = yaml.load(config_text)
config

输出:

{'cell_types': {'Mono': ['Corces2016_4983.7A']},
 'index': '/home/genomes_and_index_files/hg19',
 'samples': {'Corces2016_4983.7A_Mono': 'fastq_files/SRR2920475',
  'Corces2016_4983.7B_Mono': 'fastq_files/SRR2920476'}}

我们获得了一个字典,其中键"samples"与嵌套字典相关联.

We obtained a dictionary in which the key "samples" is associated with a nested dictionary.

# We can access the nested dictionary as follows
config["samples"]
# Note that single quotes could be used instead of double quotes
# Python interactive interpreter uses single quotes when it displays strings

输出:

{'Corces2016_4983.7A_Mono': 'fastq_files/SRR2920475',
 'Corces2016_4983.7B_Mono': 'fastq_files/SRR2920476'}

# We can access the value corresponding to one of the keys
# again using square brackets
config["samples"]["Corces2016_4983.7A_Mono"]

输出:

'fastq_files/SRR2920475'

# Now, we will simulate a `wildcards` object that has a `sample` attribute
# We'll use a namedtuple for that
# https://docs.python.org/3/library/collections.html#collections.namedtuple
from collections import namedtuple
Wildcards = namedtuple("Wildcards", ["sample"])
wildcards = Wildcards(sample="Corces2016_4983.7A_Mono")
wildcards.sample

输出:

'Corces2016_4983.7A_Mono'


编辑(15/11/2018):我发现了一种创建通配符的更好方法:


Edit (15/11/2018): I found out a better way of creating wildcards:

from snakemake.io import Wildcards
wildcards = Wildcards(fromdict={"sample": "Corces2016_4983.7A_Mono"})


# We can use this attribute as a key in the nested dictionary
# instead of using directly the string
config["samples"][wildcards.sample]
# No quotes here: `wildcards.sample` is a string variable

输出:

'fastq_files/SRR2920475'

解构 expand

# Now, the expand of the example works, and it results in a list with two strings
expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])    
# Note: here, single quotes are used for the string "sample",
# in order not to close the opening double quote of the whole string

输出:

['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']

# Internally, I think what happens is something similar to the following:
filename_template = f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz"

# This template is then used for each element of this "list comprehension"    
[filename_template.format(num=num) for num in [1, 2]]

输出:

['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']

# This is equivalent to building the list using a for loop:
filenames = []
for num in [1, 2]:
    filename = filename_template.format(num=num)
    filenames.append(filename)
filenames

输出:

['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']

字符串模板和格式

# It is interesting to have a look at `filename_template`    
filename_template

输出:

'fastq_files/SRR2920475_{num}.fastq.gz'

# The part between curly braces can be substituted
# during a string formatting operation:
"fastq_files/SRR2920475_{num}.fastq.gz".format(num=1)

输出:

'fastq_files/SRR2920475_1.fastq.gz'

现在,让我们进一步展示如何使用字符串格式.

Now let's further show how string formatting can be used.

# In python 3.6 and above, one can create formatted strings    
# in which the values of variables are interpreted inside the string    
# if the string is prefixed with `f`.
# That's what happens when we create `filename_template`:
filename_template = f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz"    
filename_template    

输出:

'fastq_files/SRR2920475_{num}.fastq.gz'

在格式化字符串期间发生了两次替换:

Two substitutions happened during the formatting of the string:

  1. config ['samples'] [wildcards.sample] 的值用于制作字符串的第一部分.(单引号用于 sample ,因为此python表达式位于用双引号构建的字符串中.)

  1. The value of config['samples'][wildcards.sample] was used to make the first part of the string. (Single quotes were used around sample because this python expression was inside a string built with double quotes.)

作为格式化操作的一部分, num 周围的双括号被减少为单个.因此,我们可以在涉及 num 的进一步格式化操作中再次使用它.

The double brackets around num were reduced to single ones as part of the formatting operation. That's why we can then use this again in further formatting operations involving num.

# Equivalently, without using 3.6 syntax:    
filename_template = "{filename_prefix}_{{num}}.fastq.gz".format(
    filename_prefix = config["samples"][wildcards.sample])
filename_template

输出:

'fastq_files/SRR2920475_{num}.fastq.gz'

# We could achieve the same by first extracting the value
# from the `config` dictionary    
filename_prefix = config["samples"][wildcards.sample]
filename_template = f"{filename_prefix}_{{num}}.fastq.gz"
filename_template

输出:

'fastq_files/SRR2920475_{num}.fastq.gz'

# Or, equivalently:
filename_prefix = config["samples"][wildcards.sample]
filename_template = "{filename_prefix}_{{num}}.fastq.gz".format(
    filename_prefix=filename_prefix)
filename_template

输出:

'fastq_files/SRR2920475_{num}.fastq.gz'

# We can actually perform string formatting on several variables
# at the same time:
filename_prefix = config["samples"][wildcards.sample]
num = 1
"{filename_prefix}_{num}.fastq.gz".format(
    filename_prefix=filename_prefix, num=num)

输出:

'fastq_files/SRR2920475_1.fastq.gz'

# Or, using 3.6 formatted strings
filename_prefix = config["samples"][wildcards.sample]
num = 1
f"{filename_prefix}_{num}.fastq.gz"

输出:

'fastq_files/SRR2920475_1.fastq.gz'

# We could therefore build the result of the expand in a single step:
[f"{config['samples'][wildcards.sample]}_{num}.fastq.gz" for num in [1, 2]]

输出:

['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']


关于问题C的评论

就Python如何构建字符串而言,以下内容有些复杂:


Comments about question C

The following is a bit complex, in terms of how Python will build the string:

input:
    lambda wildcards: expand(f"fastq_files/{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])

但是它应该可以工作,正如我们在以下模拟中看到的那样:

But it should work, as we can see in the following simulation:

from collections import namedtuple
from snakemake.io import expand

Wildcards = namedtuple("Wildcards", ["sample"])
wildcards = Wildcards(sample="Corces2016_4983.7A_Mono")
config = {"samples": {
    "Corces2016_4983.7A_Mono": "SRR2920475",
    "Corces2016_4983.7B_Mono": "SRR2920476"}}
expand(
    f"fastq_files/{config['samples'][wildcards.sample]}_{{num}}.fastq.gz",
    num=[1,2])

输出:

['fastq_files/SRR2920475_1.fastq.gz', 'fastq_files/SRR2920475_2.fastq.gz']

trim_galore_pe 规则中的问题实际上在其 output 部分中:您不应在此处使用 {wildcards.sample} ,但只需 {sample} .

The problem in the trim_galore_pe rule is actually in its output section: You shouldn't use {wildcards.sample} there, but just {sample}.

规则的 output 部分中,您可以通过将想要获取的文件与给定的模式相匹配,来通知snakemake该规则的给定实例的通配符属性.与花括号匹配的部分将用于设置相应属性名称的值.

The output section of a rule is where you inform snakemake of what the wildcards attributes will be for a given instance of the rule, by matching the file it wants to obtain with the patterns given. The parts matching the curly braces will be used to set the values of the corresponding attribute name.

例如,如果snakemake想要一个名为"trimmed_fastq_files/Corces2016_4983.7A_Mono_1_val_1.fq.gz" 的文件,它将尝试将其与所有规则输出部分中存在的所有模式进行匹配,并最终找到该文件一个:"trimmed_fastq_files/{sample} _1_val_1.fq.gz"

For instance, if snakemake wants a file called "trimmed_fastq_files/Corces2016_4983.7A_Mono_1_val_1.fq.gz", it will try to match this against all patterns present in all rule's output sections and eventually find this one: "trimmed_fastq_files/{sample}_1_val_1.fq.gz"

幸运的是,通过在 Corces2016_4983.7A_Mono {sample} 部分之间建立对应关系,它将能够将文件名与模式匹配.然后,它将在本地通配符实例中放置一个 sample 属性,就像我手动执行以下操作一样:

Luckily, it will be able to match the filename with the pattern by establishing a correspondence between Corces2016_4983.7A_Mono and the {sample} part. It will then put a sample attribute in the local wildcards instance, a bit like if I was manually doing the following:

Wildcards = namedtuple("Wildcards", ["sample"])
wildcards = Wildcards(sample="Corces2016_4983.7A_Mono")

我不知道如果使用 {wildcards.sample} 而不是 {wildcards} ,snakemake会发生什么,但是让我们尝试一下我的仿真框架:

I don't know what happens exactly in snakemake if you use {wildcards.sample} instead of {wildcards}, but let's try with my simulation framework:

Wildcards = namedtuple("Wildcards", ["sample"])
wildcards = Wildcards(wildcards.sample="Corces2016_4983.7A_Mono")
  File "<ipython-input-12-c02ce12bff85>", line 1
    wildcards = Wildcards(wildcards.sample="Corces2016_4983.7A_Mono")
                         ^
SyntaxError: keyword can't be an expression

接下来的尝试如何?

output:
    expand(f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}_val_{{num}}.fq.gz", num=[1,2]),

在这里,我的理解是Python首先尝试将 f 字符串格式应用于 f"trimmed_fastq_files/{config ['samples'] [wildcards.sample]} _ {{num}} _ val _ {{{num}}.fq.gz".为此,它将需要能够评估 config ['samples'] [wildcards.sample] ,但是 wildcards 对象尚不存在.因此,通配符未定义.仅在将下游"规则所需的文件名与包含 {attribute_name} 模式的字符串匹配后,才会生成通配符.但这是snakemake当前正在尝试构建的字符串.

Here, my understanding is that Python first tries to apply the f string formatting on f"trimmed_fastq_files/{config['samples'][wildcards.sample]}_{{num}}_val_{{num}}.fq.gz". To do that, it will need to be able to evaluate config['samples'][wildcards.sample], but the wildcards object does not exist yet. Hence the wildcards not defined. wildcards would be generated only after matching the name of a file needed by a "downstream" rule with a string containing {attribute_name} patterns. But this is the string that snakemake is currently trying to build.

这里有一些重要的要记住的地方:

Here are some important points to remember:

    在规则实例中,
  • 通配符实际上仅在本地存在,将其输出与另一个下游"规则实例所需的文件进行匹配之后.
  • 您没有在输入部分中定义变量.您可以使用变量来构建规则实例将需要的文件的具体名称(或更准确地说,是您说要在运行规则实例之前存在的文件:规则实际上不需要使用那些文件).这些变量是在规则范围之外,在纯Python模式下直接在snakefile的底层定义的变量以及本地 wildcards 对象.默认情况下, {attribute_name} 占位符将替换为本地 wildcards 对象的属性("{sample}" 变为"Corces2016_4983.7A_Mono" ),但是如果您想做一些更复杂的事情来构建文件名,则需要通过一个函数来执行此操作,该函数必须显式处理此 wildcards 对象( lambda通配符:f"{wildcards.sample}" 变为"Corces2016_4983.7A_Mono" ).
  • wildcards actually exist only locally, in an instance of a rule, after having matched its output with a file required by another "downstream" rule instance.
  • You don't define variables in input sections. You use variables to build the concrete names of the files that the rule instance will need (or, more precisely, that you say you want to exist before the rule instance can be run: the rule does not need to actually use those files). Those variables are those defined outside the scope of the rules, directly at the ground level of the snakefile, in pure Python mode, and the local wildcards object. By default, the {attribute_name} placeholders will be substituted by the attributes of the local wildcards object ("{sample}" becomes "Corces2016_4983.7A_Mono"), but if you want to do more complicated stuff to build the file names, you need to do this via a function that will have to explicitly handle this wildcards object (lambda wildcards: f"{wildcards.sample}" becomes "Corces2016_4983.7A_Mono").

这篇关于Snakemake:关于如何正确访问配置文件的困惑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆