如何从一个样本中选择所有文件? [英] How to select all files from one sample?

查看:34
本文介绍了如何从一个样本中选择所有文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在弄清楚如何使输入指令仅选择以下规则中的所有 {samples} 文件时遇到问题.

I have a problem figuring out how to make the input directive only select all {samples} files in the rule below.

rule MarkDup:
    input:
        expand("Outputs/MergeBamAlignment/{samples}_{lanes}_{flowcells}.merged.bam", zip,
            samples=samples['sample'],
            lanes=samples['lane'],
            flowcells=samples['flowcell']),
    output:
        bam = "Outputs/MarkDuplicates/{samples}_markedDuplicates.bam",
        metrics = "Outputs/MarkDuplicates/{samples}_markedDuplicates.metrics",
    shell:
        "gatk --java-options -Djava.io.tempdir=`pwd`/tmp \
        MarkDuplicates \
        $(echo ' {input}' | sed 's/ / --INPUT /g') \
        -O {output.bam} \
        --VALIDATION_STRINGENCY LENIENT \
        --METRICS_FILE {output.metrics} \
        --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 200000 \
        --CREATE_INDEX true \
        --TMP_DIR Outputs/MarkDuplicates/tmp"

目前它将创建正确命名的输出文件,但它会根据所有通配符选择与模式匹配的所有文件.所以我可能已经成功了一半.我尝试将输入指令中的 {samples} 更改为 {{samples}} ,如下所示:

Currently it will create correctly named output files, but it selects all files that match the pattern based on all wildcards. So I'm perhaps halfway there. I tried changing {samples} to {{samples}} in the input directive as such:

expand("Outputs/MergeBamAlignment/{{samples}}_{lanes}_{flowcells}.merged.bam", zip,
            lanes=samples['lane'],
            flowcells=samples['flowcell']),`

但这以某种方式打破了之前的规则.所以解决方案就像

but this broke the previous rule somehow. So the solution is something like

input:
     "{sample}_*.bam"

但这显然行不通.是否可以使用函数收集与 {sample}_*.bam 匹配的所有文件并将其用作输入?如果是这样,该函数是否仍然可以与 shell 指令中的 $(echo ' {input}' etc...) 一起使用?

But clearly this doesn't work. Is it possible to collect all files that match {sample}_*.bam with a function and use that as input? And if so, will the function still work with $(echo ' {input}' etc...) in the shell directive?

推荐答案

如果只想要目录下的所有文件,可以使用 lambda 函数

If you just want all the files in the directory, you can use a lambda function

from glob import glob

rule MarkDup:
    input:
        lambda wcs: glob('Outputs/MergeBamAlignment/%s*.bam' % wcs.samples)
    output:
        bam="Outputs/MarkDuplicates/{samples}_markedDuplicates.bam",
        metrics="Outputs/MarkDuplicates/{samples}_markedDuplicates.metrics"
    shell:
        ...

请注意,此方法无法对丢失的文件进行任何检查,因为它始终会报告所需的文件是存在的文件.如果您确实需要确认上游规则已被执行,您可以让之前的规则触摸一个标志,然后您需要将其作为此规则的输入(尽管您实际上不会将该文件用于执行执行顺序以外的任何其他用途).

Just be aware that this approach can't do any checking for missing files, since it will always report that the files needed are the files that are present. If you do need confirmation that the upstream rule has been executed, you can have the previous rule touch a flag, which you then require as input to this rule (though you don't actually use the file for anything other than enforcing execution order).

这篇关于如何从一个样本中选择所有文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆