在Snakemake中使用多个文件名作为通配符 [英] Using multiple filenames as wildcards in Snakemake

查看:222
本文介绍了在Snakemake中使用多个文件名作为通配符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个在snakemake中实现bedtools的规则,该规则将closest一个文件,其中一堆文件位于另一个目录中.

I am trying to create a rule to implement bedtools in snakemake, which will closest a file with bunch of files in another directory.

我所拥有的是/home/bedfiles目录下的20张病床文件:

What I have is, under /home/bedfiles directory, 20 bed files:

1A.bed , 2B_83.bed , 3f_33.bed ...

我想要的是在/home/bedfiles目录下的20个修改过的床文件:

What I want is, under /home/bedfiles directory, 20 modified bed files:

1A_modified,  2B_83_modified , 3f_33_modified ...

所以bash命令应该是:

So the bash command would be :

filelist='/home/bedfiles/*.bed'
for mfile in $filelist;
do
bedtools closest -a /home/other/merged.txt -b ${mfile} > ${mfile}_modified

因此,此命令将在/home/bedfiles目录中创建扩展名为_modified的文件.

So this command would make files with _modified extension, in /home/bedfiles directory.

我想用Snakemake来实现,但是我一直遇到语法错误,我不知道如何解决.我的审判是:

I want to implement this with Snakemake, however I keep having a syntax error, that I have no idea of how to fix. My trial is:

第一步:在目录中获取床文件的第一部分

FIRSTPART = [f.split(".")[0] for f in os.listdir("/home/bedfiles") if f.endswith('.bed')]

第2步:定义输出名称和文件夹

MODIFIED = expand("/home/bedfiles/{first}_modified", first=FIRSTPART)

第3步:在rule all中编写:

Step3: Writing this in rule all:

rule all:
   input: MODIFIED

第4步:制定特定规则以实施最接近的卧床工具"

rule closest:

    input:
        input1 = "/home/other/merged.txt" , \
        input2 = expand("/home/bedfiles/{first}.bed", first=FIRSTPART) 

    output:
        expand("/home/bedfiles/{first}_modified", first=FIRSTPART)  

    shell:
        """ bedtools closest -a {input.input1} -b {input.input2} > {output} """

在规则全部输入的行上,我抛出了错误:

And it throws me the error at the line for rule all,input:

invalid syntax

您知道如何克服此错误或以其他任何方式实施此错误吗?

Do you know how to overpass this error or any other way to implement it?

PS:不能一一写入文件名.

PS : Writing the names of the files one by one is not possible.

推荐答案

在您定义的inputoutput中的output上删除对expand的调用.您当前正在传递20个文件名的矢量作为input.input2和20个文件名的矢量作为output.

Remove the call to expand in your definition of input and output in closest. You're currently passing in a vector of 20 filenames as input.input2 and a vector of 20 filenames as output.

也就是说,您的规则closest当前尝试运行一次并创建20个文件;而它应该运行20次并每次创建一个文件.

That is, your rule closest is currently trying to run once and create 20 files; whereas it should run 20 times and create a single file each time.

closest中,您希望每次运行规则时input.input2是单个文件,而output是单个文件:

In closest you want input.input2 to be a single file and output to be a single file each time that rule is ran:

FIRSTPART = [f.split(".")[0] for f in os.listdir("/home/bedfiles") if f.endswith('.bed')]

print("These are the input files:")
print([f + ".bed" for f in FIRSTPART])

MODIFIED = expand("/home/bedfiles/{first}_modified", first=FIRSTPART)
print("These will be created")
print(MODIFIED)

rule all:
   input: MODIFIED

rule closest:
    message: """
        Converts /home/other/merged.txt and /some/dir/xyz.bed
        into /some/dir/xyz_modified
        """

    input:
        input1 = "/home/other/merged.txt",
        input2 = "{prefix}.bed" 

    output:    "{prefix}_modified"  

    shell:
        """ 
        bedtools closest -a {input.input1} -b {input.input2} > {output}
        """


这是一个实验:


Here's an experiment:

将自己移至临时目录,然后在该目录中执行以下操作:

Move yourself into a temporary directory and within that directory do the following:

mkdir bedfiles                                                                  
touch bedfiles/{a,b,c,d}.bed

然后在当前目录中添加一个名为Snakefile的文件,其中包含以下代码

Then add a file called Snakefile into your current directory that contains the following code

import os                                                                         
import os.path
import re

input_dir = "bedfiles"
input_files = [os.path.join(input_dir, f) for f in os.listdir(input_dir)]

print(input_files)                                                                

output_files = [re.sub(".bed$", "_modified", f) for f in input_files]             

print(output_files)                                                               

rule all:                                                                         
    input: output_files                                                           

rule mover:                                                                       
    input: "{prefix}.bed"                                                         
    output: "{prefix}_modified"                                                   
    shell:                                                                        
       """ cp {input} {output} """

然后在命令行上使用snakemake运行它. Snakemake是面向目标的;它说明了如何根据现有文件进行所需的输出.

Then run it using snakemake at the command line. Snakemake is goal-oriented; it works out how to make your desired outputs based on the existing files.

这篇关于在Snakemake中使用多个文件名作为通配符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆