当不需要通配符的某些特定组合时,如何在snakemake中使用expand? [英] How to use expand in snakemake when some particular combinations of wildcards are not desired?
问题描述
假设我有以下文件,我想使用snakemake自动对其进行一些处理:
Let's suppose that I have the following files, on which I want to apply some processing automatically using snakemake:
test_input_C_1.txt
test_input_B_2.txt
test_input_A_2.txt
test_input_A_1.txt
以下蛇文件使用expand
确定所有可能的最终结果文件:
The following snakefile uses expand
to determine all the potential final results file:
rule all:
input: expand("test_output_{text}_{num}.txt", text=["A", "B", "C"], num=[1, 2])
rule make_output:
input: "test_input_{text}_{num}.txt"
output: "test_output_{text}_{num}.txt"
shell:
"""
md5sum {input} > {output}
"""
执行上面的蛇文件会导致以下错误:
Executing the above snakefile results in the following error:
MissingInputException in line 4 of /tmp/Snakefile:
Missing input files for rule make_output:
test_input_B_1.txt
该错误的原因是expand
在后台使用itertools.product
来生成通配符组合,其中一些碰巧与丢失的文件相对应.
The reason for that error is that expand
uses itertools.product
under the hood to generate the wildcards combinations, some of which happen to correspond to missing files.
如何过滤掉不需要的通配符组合?
推荐答案
expand
函数接受第二个可选的非关键字参数,以使用与默认函数不同的函数来组合通配符值.
The expand
function accepts a second optional non-keyword argument to use a different function from the default one to combine wildcard values.
可以将itertools.product
的过滤后的版本包装在一个高阶生成器中,该生成器检查产生的通配符组合是否不在预先建立的黑名单中:
One can create a filtered version of itertools.product
by wrapping it in a higher-order generator that checks that the yielded combination of wildcards is not among a pre-established blacklist:
from itertools import product
def filter_combinator(combinator, blacklist):
def filtered_combinator(*args, **kwargs):
for wc_comb in combinator(*args, **kwargs):
# Use frozenset instead of tuple
# in order to accomodate
# unpredictable wildcard order
if frozenset(wc_comb) not in blacklist:
yield wc_comb
return filtered_combinator
# "B_1" and "C_2" are undesired
forbidden = {
frozenset({("text", "B"), ("num", 1)}),
frozenset({("text", "C"), ("num", 2)})}
filtered_product = filter_combinator(product, forbidden)
rule all:
input:
# Override default combination generator
expand("test_output_{text}_{num}.txt", filtered_product, text=["A", "B", "C"], num=[1, 2])
rule make_output:
input: "test_input_{text}_{num}.txt"
output: "test_output_{text}_{num}.txt"
shell:
"""
md5sum {input} > {output}
"""
可以从配置文件中读取缺少的通配符组合.
The missing wildcards combinations can be read from the configfile.
以下是json格式的示例:
Here is an example in json format:
{
"missing" :
[
{
"text" : "B",
"num" : 1
},
{
"text" : "C",
"num" : 2
}
]
}
forbidden
集将在snakefile中读取如下:
The forbidden
set would be read as follows in the snakefile:
forbidden = {frozenset(wc_comb.items()) for wc_comb in config["missing"]}
这篇关于当不需要通配符的某些特定组合时,如何在snakemake中使用expand?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!