如何在Snakemake表格配置中使用列表,以描述生物信息流水线的测序单位 [英] How to use list in Snakemake Tabular configuration, for describing of sequencing units for bioinformatic pipeline

查看:93
本文介绍了如何在Snakemake表格配置中使用列表,以描述生物信息流水线的测序单位的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在Snakemake表格配置中使用列表.

How to use a list in Snakemake tabular config.

我使用Snakemake Tabular(与BWA mem映射)配置来描述我的排序单位(在单独的行上排序的库).在下一步的分析中,我必须合并测序单位(映射的.bed文件)并合并合并的.bam文件(每个样品一个).现在,我使用YAML配置来描述哪些单位属于哪些样本.但我希望为此使用表格格式

I use Snakemake Tabular (mapping with BWA mem) configuration to describe my sequencing units (libraries sequenced on separate lines). At the next stage of analysis I have to merge sequencing units (mapped .bed files) and take merged .bam files (one for each sample). Now I'm using YAML config for describing of what units belong to what samples. But I wish to use Tabular config for this purpose,

我不清楚如何从制表符分隔的文件的单元格中写入和调用列表(包含示例信息).

I'm not clear how to write and recall a list (containing sample information) from cell of Tab separated file.

这是我的单位表格配置:

This is how my Tablular config for units looks like:

Unit    SampleSM    LineID  PlatformPL  LibraryLB   RawFileR1   RawFileR2
sample_001.lane_L1  sample_001  lane_L1 ILLUMINA    sample_001  /user/data/sample_001.lane_L1.R1.fastq.gz   /user/data/sample_001.lane_L1.R2.fastq.gz
sample_001.lane_L2  sample_001  lane_L2 ILLUMINA    sample_001  /user/data/sample_001.lane_L2.R1.fastq.gz   /user/data/sample_001.lane_L2.R2.fastq.gz
sample_001.lane_L8  sample_001  lane_L8 ILLUMINA    sample_001  /user/data/sample_001.lane_L8.R1.fastq.gz   /user/data/sample_001.lane_L8.R2.fastq.gz
sample_002.lane_L1  sample_002  lane_L1 ILLUMINA    sample_002  /user/data/sample_002.lane_L1.R1.fastq.gz   /user/data/sample_002.lane_L1.R2.fastq.gz
sample_002.lane_L2  sample_002  lane_L2 ILLUMINA    sample_002  /user/data/sample_002.lane_L2.R1.fastq.gz   /user/data/sample_002.lane_L2.R2.fastq.gz

这是我的示例的YAML配置如下:

This is how my YAML config for Samples looks like:

samples:
 "sample_001": ["sample_001.lane_L1", "sample_001.lane_L2", "sample_001.lane_L8"]
 "sample_002": ["sample_002.lane_L1", "sample_002.lane_L2"]

我的Snakemake代码:

My Snakemake code:

import pandas as pd
import os

workdir: "/user/data/snakemake/"

configfile: "Samples.yaml"

units_table = pd.read_table("Units.tsv").set_index("Unit", drop=False)

rule all:
    input:
        expand('map_folder/{unit}.bam', unit=units_table.Unit),
        expand('merge_bam_folder/{sample}.bam', sample=config["samples"]),

rule map_paired_end:
    input:
        r1 = lambda wildcards: expand(units_table.RawFileR1[wildcards.unit]),
        r2 = lambda wildcards: expand(units_table.RawFileR2[wildcards.unit])
    output:
        bam = 'map_folder/{unit}.bam'
    params: 
        bai = 'map_folder/{unit}.bam.bai',
        ref='/user/data/human_g1k_v37.fasta.gz',
        SampleSM = lambda wildcards: units_table.SampleSM[wildcards.unit],
        LineID = lambda wildcards: units_table.LineID[wildcards.unit],
        PlatformPL = lambda wildcards: units_table.PlatformPL[wildcards.unit],
        LibraryLB = lambda wildcards: units_table.LibraryLB[wildcards.unit]
    threads:
        16  
    shell:
            r"""
                    seqtk mergepe {input.r1} {input.r2}\
                    | bwa mem -M -t {threads} -v 3 \
                    {params.ref} - \
                    -R "@RG\tID:{params.LineID}\tSM:{params.SampleSM}\tPL:{params.PlatformPL}\tLB:{params.LibraryLB}"\
                    | samtools view -u -Sb - \
                    | samtools sort - -m 4G -o {output.bam} 

                    samtools index {output.bam}
                    """

rule samtools_merge_bam:
    input:  
        lambda wildcards: expand('map_folder/{file}.bam', file=config['samples'][wildcards.sample])
    output:
        bam = 'merge_bam_folder/{sample}.bam'
    threads:
        1
    shell:  
                    r"""
                    samtools merge {output.bam} {input}

                    samtools index {output.bam}
                    """

推荐答案

下面如何处理?

我已经排除了Samples.yaml,因为鉴于您的样品单,我认为这是没有必要的.

I have excluded the Samples.yaml as I think it is not necessary given your sample sheet.

在规则samtools_merge_bam中,您将收集共享同一SampleSM的所有单位重量文件.这些unit-bam文件在map_paired_end中创建,其中lambda表达式收集每个单元的fastq文件.

In rule samtools_merge_bam you collect all unit-bam files sharing the same SampleSM. These unit-bam files are created in map_paired_end where the lambda expression collects the fastq files for each unit.

还要注意,我已经从all规则中删除了unit-bam文件,因为(我认为)这些文件只是中间文件,可以使用

Note also that I have removed the unit-bam files from the all rule as (I think) these are just intermediate files and they could be marked as temporary using the temp() flag.

import pandas as pd
import os

workdir: "/output/dir" 

units_table = pd.read_table("Units.tsv")
samples= list(units_table.SampleSM.unique())

rule all:
    input:
        expand('merge_bam_folder/{sample}.bam', sample= samples),

rule map_paired_end:
    input:
        r1 = lambda wildcards: units_table.RawFileR1[units_table.Unit == wildcards.unit],
        r2 = lambda wildcards: units_table.RawFileR2[units_table.Unit == wildcards.unit],
    output:
        bam = 'map_folder/{unit}.bam'
    params: 
        bai = 'map_folder/{unit}.bam.bai',
        ref='/user/data/human_g1k_v37.fasta.gz',
        SampleSM = lambda wildcards: list(units_table.SampleSM[units_table.Unit == wildcards.unit]),
        LineID = lambda wildcards: list(units_table.LineID[units_table.Unit == wildcards.unit]),
        PlatformPL = lambda wildcards: list(units_table.PlatformPL[units_table.Unit == wildcards.unit]),
        LibraryLB = lambda wildcards: list(units_table.LibraryLB[units_table.Unit == wildcards.unit]),
    threads:
        16  
    shell:
        r"""
        seqtk mergepe {input.r1} {input.r2}\
        | bwa mem -M -t {threads} -v 3 \
        {params.ref} - \
        -R "@RG\tID:{params.LineID}\tSM:{params.SampleSM}\tPL:{params.PlatformPL}\tLB:{params.LibraryLB}"\
        | samtools view -u -Sb - \
        | samtools sort - -m 4G -o {output.bam} 

        samtools index {output.bam}
        """

rule samtools_merge_bam:
    input:  
        lambda wildcards: expand('map_folder/{unit}.bam',
            unit= units_table.Unit[units_table.SampleSM == wildcards.sample])
    output:
        bam = 'merge_bam_folder/{sample}.bam'
    threads:
        1
    shell:  
        r"""
        samtools merge {output.bam} {input}

        samtools index {output.bam}
        """

这篇关于如何在Snakemake表格配置中使用列表,以描述生物信息流水线的测序单位的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆