使用 snakemake 移动和重命名多个文件夹中的文件 [英] Move and rename files from multiple folders using snakemake
问题描述
我试图找到最优雅的解决方案,使用 snakemake 来移动和重命名存储在大约 50 个单独文件夹中的约 1000 个 fastq 文件.我最初的尝试是使用以下命令将文件位置和新样本 ID 数据存储在配置文件中:
I'm trying to find the most elegant solution, using snakemake, to move and rename ~1000 fastq files that are stored in around 50 separate folders. My original attempt was storing the file location and new sample ID data in the config file using:
配置
samples:
15533_Oct_2014/15533_L7_R1_001.fastq.gz: 15533_Extr_L7_R1.fastq.gz
15533_Oct_2014/15533_L7_R2_001.fastq.gz: 15533_Extr_L7_R2.fastq.gz
16826_Jan_2015/16826_L8_R1_001.fastq: 16826_Extr_L8_R1.fastq
16826_Jan_2015/16826_L8_R2_001.fastq: 16826_Extr_L8_R2.fastq
SNAKEFILE
rule all:
input:
expand("fastqs/{sample}", sample=[config['samples'][x] for x in config['samples']])
rule move_and_rename_fastqs:
input:
output: "fastqs/{sample}"
shell:
"""echo mv {input} {output}"""
运行 snakemake -np
产生没有错误的 shell 命令.它正确地创建了规则的 4 个实例,并使用单独的文件名(即在配置文件中冒号右侧指定的新文件名)填充{output}
.
Running snakemake -np
produces the shell commands without error. It correctly creates 4 instances of the rule and populates{output}
with an individual filename (i.e. the new filename specified to the right of the colon in the config file).
我的问题是我不是 100% 确定如何使用文件位置填充 shell 命令的 {input}
部分(即获取存储在冒号右侧的相应位置在配置文件中).当使用各种 lambda 通配符:
尝试访问这些位置时,我会出错.
My issue is that I'm not 100% sure how to populate the {input}
section of the shell command with the file location (i.e. to get the corresponding location stored to right of the colon in the config file). When using various lambda wildcards:
in an attempt to access these locations I get errors.
顺便说一下,这个帖子 建议了一种替代方法,也许更优雅,通过将文件位置/新名称存储在 .tsv
文件中来解决此问题.但是,它没有解释如何在规则内访问.tsv
文件中的信息.
Incidentally, this post suggests an alternative, and perhaps more elegant, method to tackle this by storing the file locations/new names in a .tsv
file. However, it does not explain how to access information in the .tsv
file within the rules.
我为此尝试了 Snakefile,但我不清楚如何在 rule move_and_rename_fastqs 中引用存储在
或 sampleID
和 fastq
中的信息 统治所有
.尽管 snakemake -np
在这里产生了一个输出,但它显然是笨拙的,因为 {input}
填充了分配给 fastq
的所有文件,并且作为我正在引用示例信息的两个来源(rule all
中的配置文件,rule move_and_rename_fastqs
中的 sample_file),填充 {input}
的示例 IDcode> 和 {output}
部分不匹配.
I have made an attempt at a Snakefile for this, but it is unclear to me how to reference the information stored sampleID
and fastq
either in rule move_and_rename_fastqs
or rule all
. Although snakemake -np
produces an output here, it is obviously gobbledygook as {input}
is populated with all the files assigned to fastq
, and as I'm referencing two sources for the sample information (config file in rule all
, sample_file in rule move_and_rename_fastqs
), the sample IDs populating the {input}
and {output}
sections don't match as the should.
任何有关解决此问题的最优雅解决方案的指导将不胜感激.
Any guidance with regard to the most elegant solution to get round this issue would be greatly appreciated.
SNAKEFILE 2
import pandas as pd
configfile: "config.yaml"
sample_file = config["sample_file"]
sampleID = pd.read_table(sample_file)['sampleID']
fastq = pd.read_table(sample_file)['fastq']
rule all:
input:
expand("fastqs/{sample}", sample=[config['samples'][x] for x in config['samples']])
rule move_and_rename_fastqs:
input: fastq = lambda wildcards: fastq
output: "fastqs/{sample}"
shell:
"""echo mv {input.fastq} {output}"""
sample_file
fastq sampleID
15533_Oct_2014/15533_L7_R1_001.fastq.gz 15533_Extr_L7_R1.fastq.gz
15533_Oct_2014/15533_L7_R2_001.fastq.gz 15533_Extr_L7_R2.fastq.gz
对 UNFUN CAT 的回应
import pandas as pd
configfile: "config.yaml"
sample_file = config["sample_file"]
sampleID = pd.read_table(sample_file)['sampleID']
fastq = pd.read_table(sample_file)['fastq']
df = pd.read_table(sample_file)
rule all:
input:
expand("fastqs/{sample}", sample=[config['samples'][x] for x in config['samples']])
rule move_and_rename_fastqs:
input: fastq = lambda w: df[df.sampleID == w.sample].File.tolist()
output: "fastqs/{sample}"
shell:
"""echo mv {input.fastq} {output}"""
推荐答案
import pandas as pd
configfile: "config.yaml"
sample_file = config["sample_file"]
sampleID = pd.read_table(sample_file)['sampleID']
fastq = pd.read_table(sample_file)['fastq']
df = pd.read_table(sample_file)
rule all:
input:
expand("fastqs/{sample}", sample=[config['samples'][x] for x in config['samples']])
rule move_and_rename_fastqs:
input: fastq = lambda w: df[df.sampleID == w.sample].fastq.tolist()
output: "fastqs/{sample}"
shell:
"""echo mv {input.fastq} {output}"""
无需任何配置文件的版本:
Version that works without any config-files:
import pandas as pd
from io import StringIO
sample_file = StringIO("""fastq sampleID
15533_Oct_2014/15533_L7_R1_001.fastq.gz 15533_Extr_L7_R1.fastq.gz
15533_Oct_2014/15533_L7_R2_001.fastq.gz 15533_Extr_L7_R2.fastq.gz""")
df = pd.read_table(sample_file, sep="\s+", header=0)
sampleID = df.sampleID
fastq = df.fastq
rule all:
input:
expand("fastqs/{sample}", sample=df.sampleID)
rule move_and_rename_fastqs:
input: fastq = lambda w: df[df.sampleID == w.sample].fastq.tolist()
output: "fastqs/{sample}"
shell:
"""echo mv {input.fastq} {output}"""
给出:
snakemake -np
Building DAG of jobs...
Job counts:
count jobs
1 all
2 move_and_rename_fastqs
3
[Mon Jun 29 15:57:30 2020]
rule move_and_rename_fastqs:
input: 15533_Oct_2014/15533_L7_R2_001.fastq.gz
output: fastqs/15533_Extr_L7_R2.fastq.gz
jobid: 2
wildcards: sample=15533_Extr_L7_R2.fastq.gz
echo mv 15533_Oct_2014/15533_L7_R2_001.fastq.gz fastqs/15533_Extr_L7_R2.fastq.gz
[Mon Jun 29 15:57:30 2020]
rule move_and_rename_fastqs:
input: 15533_Oct_2014/15533_L7_R1_001.fastq.gz
output: fastqs/15533_Extr_L7_R1.fastq.gz
jobid: 1
wildcards: sample=15533_Extr_L7_R1.fastq.gz
echo mv 15533_Oct_2014/15533_L7_R1_001.fastq.gz fastqs/15533_Extr_L7_R1.fastq.gz
[Mon Jun 29 15:57:30 2020]
localrule all:
input: fastqs/15533_Extr_L7_R1.fastq.gz, fastqs/15533_Extr_L7_R2.fastq.gz
jobid: 0
Job counts:
count jobs
1 all
2 move_and_rename_fastqs
3
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
这篇关于使用 snakemake 移动和重命名多个文件夹中的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!