当不需要通配符的某些特定组合时,如何在snakemake中使用expand? [英] How to use expand in snakemake when some particular combinations of wildcards are not desired?

查看:190
本文介绍了当不需要通配符的某些特定组合时,如何在snakemake中使用expand?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有以下文件,我想使用snakemake自动对其进行一些处理:

Let's suppose that I have the following files, on which I want to apply some processing automatically using snakemake:

test_input_C_1.txt
test_input_B_2.txt
test_input_A_2.txt
test_input_A_1.txt

以下蛇文件使用expand确定所有可能的最终结果文件:

The following snakefile uses expand to determine all the potential final results file:

rule all:
    input: expand("test_output_{text}_{num}.txt", text=["A", "B", "C"], num=[1, 2])

rule make_output:
    input: "test_input_{text}_{num}.txt"
    output: "test_output_{text}_{num}.txt"
    shell:
        """
        md5sum {input} > {output}
        """

执行上面的蛇文件会导致以下错误:

Executing the above snakefile results in the following error:

MissingInputException in line 4 of /tmp/Snakefile:
Missing input files for rule make_output:
test_input_B_1.txt

该错误的原因是expand在后台使用itertools.product来生成通配符组合,其中一些碰巧与丢失的文件相对应.

The reason for that error is that expand uses itertools.product under the hood to generate the wildcards combinations, some of which happen to correspond to missing files.

如何过滤掉不需要的通配符组合?

推荐答案

expand函数接受第二个可选的非关键字参数,以使用与默认函数不同的函数来组合通配符值.

The expand function accepts a second optional non-keyword argument to use a different function from the default one to combine wildcard values.

可以将itertools.product的过滤后的版本包装在一个高阶生成器中,该生成器检查产生的通配符组合是否不在预先建立的黑名单中:

One can create a filtered version of itertools.product by wrapping it in a higher-order generator that checks that the yielded combination of wildcards is not among a pre-established blacklist:

from itertools import product

def filter_combinator(combinator, blacklist):
    def filtered_combinator(*args, **kwargs):
        for wc_comb in combinator(*args, **kwargs):
            # Use frozenset instead of tuple
            # in order to accomodate
            # unpredictable wildcard order
            if frozenset(wc_comb) not in blacklist:
                yield wc_comb
    return filtered_combinator

# "B_1" and "C_2" are undesired
forbidden = {
    frozenset({("text", "B"), ("num", 1)}),
    frozenset({("text", "C"), ("num", 2)})}

filtered_product = filter_combinator(product, forbidden)

rule all:
    input:
        # Override default combination generator
        expand("test_output_{text}_{num}.txt", filtered_product, text=["A", "B", "C"], num=[1, 2])

rule make_output:
    input: "test_input_{text}_{num}.txt"
    output: "test_output_{text}_{num}.txt"
    shell:
        """
        md5sum {input} > {output}
        """


可以从配置文件中读取缺少的通配符组合.


The missing wildcards combinations can be read from the configfile.

以下是json格式的示例:

Here is an example in json format:

{
    "missing" :
    [
        {
            "text" : "B",
            "num" : 1
        },
        {
            "text" : "C",
            "num" : 2
        }
    ]
}

forbidden集将在snakefile中读取如下:

The forbidden set would be read as follows in the snakefile:

forbidden = {frozenset(wc_comb.items()) for wc_comb in config["missing"]}

这篇关于当不需要通配符的某些特定组合时,如何在snakemake中使用expand?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆