蛇形通过通配符将文件组合在一起 [英] snakemake group files together by wildcard
问题描述
我有一个包含连接示例表中列出的文件的规则的 snakemake 文件.样本表看起来像:
I've a snakemake file containing rules to concatenate files listed in a samplesheet. Samplesheet looks like :
sample unit fq1 fq2
A lane1 A.l1.1.R1.txt A.l1.1.R2.txt
A lane1 A.l1.2.R1.txt A.l1.2.R2.txt
A lane2 A.l2.R1.txt A.l2.R2.txt
B lane1 B.l1.R1.txt B.l1.R2.txt
B lane2 B.l2.R1.txt B.l2.R2.txt
我的目标是合并来自相同样本和相同单元的 fq1 文件并将它们放在 {sample}/fastq/中,并合并来自 {sample} 中的样本({sample}/fastq 中的那些)的结果文件/巴姆/
My goal is to merge fq1 files from the same sample and same unit and put them in {sample}/fastq/ and to merge the resulting files from on sample (the ones in {sample}/fastq ) in {sample}/bam/
它适用于 {sample}/fastq,但对于 {sample}/bam,{sample}/fastq 中列出的所有文件将在 {sample}/bam 中串联.有什么想法可以解决这个问题吗?
It works ok for the {sample}/fastq but for the {sample}/bam all files listed in {sample}/fastq unrearding the sample will be concatenate in {sample}/bam. Any idea to solve this ?
import pandas as pd
shell.executable("bash")
configfile: "config.yaml"
# open samplesheet
units = pd.read_table(config["units"], dtype=str)
# set df index
units=units.set_index(["sample","unit"])
rule all:
input:
expand("{sample}/bam/{sample}_bam.txt",
sample=units.index.get_level_values('sample').unique().values),
# functions to return information in the samplesheet
def get_fastq_r1(wildcards):
return units.loc[(wildcards.sample, wildcards.unit), ["fq1"]].dropna().values.flatten()
def get_fastq_r2(wildcards):
return units.loc[(wildcards.sample, wildcards.unit), ["fq2"]].dropna().values.flatten()
# merge files from the same sample and unit
rule merge_fastq_lane:
input:
r1 = get_fastq_r1,
r2 = get_fastq_r2
output:
r1_o = "{sample}/fastq/{sample}_{unit}_merge_R1.fastq",
r2_o = "{sample}/fastq/{sample}_{unit}_merge_R2.fastq"
message:
"Merge fastq from the same sample and lane"
shell:
"""
cat {input.r1} > {output.r1_o}
cat {input.r2} > {output.r2_o}
"""
# merge files from the same sample
rule align_lane:
input:
r1 = expand("{sample}/fastq/{sample}_{unit}_merge_R1.fastq",
unit=units.index.get_level_values('unit').unique().values,
sample=units.index.get_level_values('sample').unique().values),
r2 = expand("{sample}/fastq/{sample}_{unit}_merge_R2.fastq",
unit=units.index.get_level_values('unit').unique().values,
sample=units.index.get_level_values('sample').unique().values)
output:
bam = "{sample}/bam/{sample}_bam.txt"
message:
"Align lane with bwa mem"
shell:
"""
cat {input.r1} {input.r2} > {output.bam}
"""
推荐答案
在您的规则 align_lane
中,您的输入列出了所有可能的样本.由于您在输出中使用 sample
通配符,我猜您想在输入中使用它.在扩展函数中使用通配符的方法是将括号加倍.所以我想你的规则应该是这样的(如果我理解正确的话):
In your rule align_lane
, your input lists all the samples possible. Since you're using the sample
wildcard in the ouput, I guess that you want to use it in the input. The way to use a wildcard in an expand function is to double the brackets. So I guess your rule should look like this (if I understood correctly):
rule align_lane:
input:
r1 = expand("{{sample}}/fastq/{{sample}}_{unit}_merge_R1.fastq",
unit=units.index.get_level_values('unit').unique().values),
r2 = expand("{{sample}}/fastq/{{sample}}_{unit}_merge_R2.fastq",
unit=units.index.get_level_values('unit').unique().values)
output:
bam = "{sample}/bam/{sample}_bam.txt"
message:
"Align lane with bwa mem"
shell:
"""
cat {input.r1} {input.r2} > {output.bam}
"""
这篇关于蛇形通过通配符将文件组合在一起的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!