接受略有不同的Snakemaker规则输入(.fq与.fq.gz) [英] Accepting slightly different inputs to snakemake rule (.fq vs .fq.gz)

查看:18
本文介绍了接受略有不同的Snakemaker规则输入(.fq与.fq.gz)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Snakemake的新手,我希望能够获取一对.fq文件或一对.fq.gz文件,并通过trim_galore运行它们来获得一对经过修剪的.fq.gz输出文件。在不给出我所有的Snakefile的情况下,我有了下面这个难看的解决方案,我只是复制了规则并更改了输入。什么是更好的解决方案?

#Trim galore paired end trimming rule for unzipped fastqs:
rule trim_galore_unzipped_PE:
    input:
        r1=join(config['fq_in_path'], '{sample}1.fq'),
        r2=join(config['fq_in_path'], '{sample}2.fq'),
    output:
        r1=join(config['trim_out_path'], '{sample}1_val_1.fq.gz'),
        r2=join(config['trim_out_path'], '{sample}2_val_2.fq.gz'),
    params:
        out_path=config['trim_out_path'],
    conda:
        'envs/biotools.yaml',
    shell:
        'trim_galore --gzip -o {params.out_path} --paired {input.r1} {input.r2}'

#Trim galore paired end trimming rule for gzipped fastqs:
rule trim_galore_zipped_PE:
    input:
        r1=join(config['fq_in_path'], '{sample}1.fq.gz'),
        r2=join(config['fq_in_path'], '{sample}2.fq.gz'),
    output:
        r1=join(config['trim_out_path'], '{sample}1_val_1.fq.gz'),
        r2=join(config['trim_out_path'], '{sample}2_val_2.fq.gz'),
    params:
        out_path=config['trim_out_path'],
    conda:
        'envs/biotools.yaml',
    shell: 
        'trim_galore --gzip -o {params.out_path} --paired {input.r1} {input.r2}'

推荐答案

使用输入函数可能是最佳解决方案,如下所示:

  1. 将通配符传递给输入函数
  2. 使用已知的YAML值,使用该示例名称构建理论文件名。
  3. 使用python函数检查哪个文件(从技术上讲是文件后缀)有效
  4. 生成有效文件列表
  5. 返回并解压缩有效文件列表。

备注:

  • 输入和输出应具有相同的通配符,否则会导致问题
  • 在输入函数中,确保它不能返回空字符串,因为Snakemake将其解释为"缺少输入"要求,而这不是您想要的。
  • 如果您采纳这些建议,请更新规则名称,我忘记了。

Snakefile:

 configfile: "config.yaml"

 from os.path import join
 from os.path import exists

 rule all:
     input:
         expand("{trim_out_path}/{sample}.{readDirection}.fq.gz",
             trim_out_path=config["trim_out_path"],
             sample=config["sampleList"],
             readDirection=['1','2'])


 def trim_galore_input_determination(wildcards):
     potential_file_path_list = []
     # Cycle through both suffix possibilities:
     for fastqSuffix in [".fq", ".fq.gz"]:

         # Cycle through both read directions
         for readDirection in ['.1','.2']:

             #Build the list for ech suffix
             potential_file_path = config["fq_in_path"] + "/" + wildcards.sample + readDirection + fastqSuffix

             #Check if this file actually exists
             if exists(potential_file_path):

                 #If file is legit, add to list of acceptable files
                 potential_file_path_list.append(potential_file_path)

     # Checking for an empty list
     if len(potential_file_path_list):
         return potential_file_path_list
     else:
         return ["trim_galore_input_determination_FAILURE" + wildcards.sample]

 rule trim_galore_unzipped_PE:
     input:
         unpack(trim_galore_input_determination)
     output:
         expand("{trim_out_path}/{{sample}}.{readDirection}.fq.gz",
             trim_out_path=config["trim_out_path"],
             readDirection=['1','2'])
     params:
         out_path=config['trim_out_path'],
     conda:
         'envs/biotools.yaml',
     shell:
         'trim_galore --gzip -o {params.out_path} --paired {input}'

config.yaml:

fq_in_path: input/fq
trim_out_path: output
sampleList: ["mySample1", "mySample2"]

$tree:

|-- [tboyarsk      1540 Sep  6 15:17]  Snakefile
|-- [tboyarsk        82 Sep  6 15:17]  config.yaml
|-- [tboyarsk       512 Sep  6  8:55]  input
|   |-- [tboyarsk       512 Sep  6  8:33]  fq
|   |   |-- [tboyarsk         0 Sep  6  7:50]  mySample1.1.fq
|   |   |-- [tboyarsk         0 Sep  6  8:24]  mySample1.2.fq
|   |   |-- [tboyarsk         0 Sep  6  7:50]  mySample2.1.fq
|   |   `-- [tboyarsk         0 Sep  6  8:24]  mySample2.2.fq
|   `-- [tboyarsk       512 Sep  6  8:55]  fqgz
|       |-- [tboyarsk         0 Sep  6  7:50]  mySample1.1.fq.gz
|       |-- [tboyarsk         0 Sep  6  8:32]  mySample1.2.fq.gz
|       |-- [tboyarsk         0 Sep  6  8:33]  mySample2.1.fq.gz
|       `-- [tboyarsk         0 Sep  6  8:32]  mySample2.2.fq.gz
`-- [tboyarsk       512 Sep  6  7:55]  output

$Snakemake-Dry(输入:fg)

 rule trim_galore_unzipped_PE:
     input: input/fq/mySample1.1.fq, input/fq/mySample1.2.fq
     output: output/mySample1.1.fq.gz, output/mySample1.2.fq.gz
     jobid: 1
     wildcards: sample=mySample1


 rule trim_galore_unzipped_PE:
     input: input/fq/mySample2.1.fq, input/fq/mySample2.2.fq
     output: output/mySample2.1.fq.gz, output/mySample2.2.fq.gz
     jobid: 2
     wildcards: sample=mySample2


 localrule all:
     input: output/mySample1.1.fq.gz, output/mySample2.1.fq.gz, output/mySample1.2.fq.gz, output/   mySample2.2.fq.gz
     jobid: 0

 Job counts:
         count   jobs
         1       all
         2       trim_galore_unzipped_PE
         3

$Snakemake-Dry(输入:fgqz)

 rule trim_galore_unzipped_PE:
     input: input/fqgz/mySample1.1.fq.gz, input/fqgz/mySample1.2.fq.gz
     output: output/mySample1.1.fq.gz, output/mySample1.2.fq.gz
     jobid: 1
     wildcards: sample=mySample1


 rule trim_galore_unzipped_PE:
     input: input/fqgz/mySample2.1.fq.gz, input/fqgz/mySample2.2.fq.gz
     output: output/mySample2.1.fq.gz, output/mySample2.2.fq.gz
     jobid: 2
     wildcards: sample=mySample2


 localrule all:
     input: output/mySample1.1.fq.gz, output/mySample1.2.fq.gz, output/mySample2.1.fq.gz, output/   mySample2.2.fq.gz
     jobid: 0

 Job counts:
         count   jobs
         1       all
         2       trim_galore_unzipped_PE
         3
有一些方法可以使其更通用,但由于您声明并使用YAML配置来构建大部分文件名,因此我将避免在答案中讨论它。我只是说这是可能的,并在某种程度上受到鼓励。

"--配对的{输入}"将展开以提供这两个文件。由于for循环,%1将始终位于%2之前。

这篇关于接受略有不同的Snakemaker规则输入(.fq与.fq.gz)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆