Snakemake:在一个通配符上组合输入 [英] Snakemake: combine inputs over one wildcard

查看:22
本文介绍了Snakemake:在一个通配符上组合输入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道您是否能够提供有关定义Snakemake规则以组合一个而不是所有通配符的建议?我的数据是有组织的,这样我就有了运行和样本;大多数(但不是全部)样本在每次运行中都进行了重新排序。因此,我有针对每个样本运行的预处理步骤。然后,我有一个步骤,为每个样本的每次运行组合BAM文件。然而,我遇到的问题是,我对如何定义规则感到有点困惑,这样我就可以列出与样本相对应的所有个人bam的输入(来自不同的运行)。

为了清楚起见,我将我的整个管道放在下面,但我真正的问题是规则组合_bams。如何在输入中列出单个样本的所有bam?

任何建议都是很棒的!非常提前感谢您!

# Define samples and runs
RUNS, SAMPLES = glob_wildcards("/labs/jandr/walter/tb/data/Stanford/{run}/{samp}_L001_R1_001.fastq.gz")
print("runs are: ", RUNS)
print("samples are: ", SAMPLES)

rule all:
  input:
     #trim = ['process/trim/{run}_{samp}_trim_1.fq.gz'.format(samp=sample_id, run=run_id) for sample_id, run_id in zip(sample_ids, run_ids)],
     trim = expand(['process/trim/{run}_{samp}_trim_1.fq.gz'], zip, run = RUNS, samp = SAMPLES),
     kraken=expand('process/trim/{run}_{samp}_trim_kr_1.fq.gz', zip, run = RUNS, samp = SAMPLES),
     bams=expand('process/bams/{run}_{samp}_bwa_MTB_ancestor_reference_rg_sorted.bam', zip, run = RUNS, samp = SAMPLES), # add fixed ref/mapper (expand with zip doesn't allow these to repeate)
     combined_bams=expand('process/bams/{samp}_bwa_MTB_ancestor_reference.merged.rmdup.bam', samp = np.unique(SAMPLES))

# Trim reads for quality. 
rule trim_reads:  
  input: 
    p1='/labs/jandr/walter/tb/data/Stanford/{run}/{samp}_L001_R1_001.fastq.gz', # update inputs so they only include those that exist use zip.
    p2='/labs/jandr/walter/tb/data/Stanford/{run}/{samp}_L001_R2_001.fastq.gz'
  output: 
    trim1='process/trim/{run}_{samp}_trim_1.fq.gz',
    trim2='process/trim/{run}_{samp}_trim_2.fq.gz'
  log: 
    'process/trim/{run}_{samp}_trim_reads.log'
  shell:
    '/labs/jandr/walter/tb/scripts/trim_reads.sh {input.p1} {input.p2} {output.trim1} {output.trim2} &>> {log}'

# Filter reads taxonomically with Kraken.   
rule taxonomic_filter:
  input:
    trim1='process/trim/{run}_{samp}_trim_1.fq.gz',
    trim2='process/trim/{run}_{samp}_trim_2.fq.gz'
  output: 
    kr1='process/trim/{run}_{samp}_trim_kr_1.fq.gz',
    kr2='process/trim/{run}_{samp}_trim_kr_2.fq.gz',
    kraken_stats='process/trim/{run}_{samp}_kraken.report'
  log: 
    'process/trim/{run}_{samp}_run_kraken.log'
  threads: 8
  shell:
    '/labs/jandr/walter/tb/scripts/run_kraken.sh {input.trim1} {input.trim2} {output.kr1} {output.kr2} {output.kraken_stats} &>> {log}'

# Map reads.
rule map_reads:
  input:
    ref_path='/labs/jandr/walter/tb/data/refs/{ref}.fasta.gz',
    kr1='process/trim/{run}_{samp}_trim_kr_1.fq.gz',
    kr2='process/trim/{run}_{samp}_trim_kr_2.fq.gz'
  output:
    bam='process/bams/{run}_{samp}_{mapper}_{ref}_rg_sorted.bam'
  params:
    mapper='{mapper}'
  log:
    'process/bams/{run}_{samp}_{mapper}_{ref}_map.log'
  threads: 8
  shell:
    "/labs/jandr/walter/tb/scripts/map_reads.sh {input.ref_path} {params.mapper} {input.kr1} {input.kr2} {output.bam} &>> {log}"


# Combine reads and remove duplicates (per sample).
rule combine_bams:
  input:
    bams = 'process/bams/{run}_{samp}_bwa_MTB_ancestor_reference_rg_sorted.bam'
  output: 
    combined_bam = 'process/bams/{samp}_{mapper}_{ref}.merged.rmdup.bam'
  log: 
     'process/bams/{samp}_{mapper}_{ref}_merge_bams.log'
  threads: 8
  shell:
    "sambamba markdup -r -p -t {threads} {input.bams} {output.combined_bam}"

推荐答案

创建词典以将每个样本与其运行列表相关联。

然后,对于Combine_bams规则,使用一个输入函数来使用字典为该样本生成输入文件。

rule combine_bams:
  input:
    bams = lambda wildcards: expand('process/bams/{run}_{{samp}}_bwa_MTB_ancestor_reference_rg_sorted.bam', run=sample_dict[wildcards.sample])
  output: 
    combined_bam = 'process/bams/{samp}_{mapper}_{ref}.merged.rmdup.bam'
  log: 
     'process/bams/{samp}_{mapper}_{ref}_merge_bams.log'
  threads: 8
  shell:
    "sambamba markdup -r -p -t {threads} {input.bams} {output.combined_bam}"

这篇关于Snakemake:在一个通配符上组合输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆