基因组覆盖率作为滑动窗口 [英] Genome coverage as sliding window

查看:330
本文介绍了基因组覆盖率作为滑动窗口的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用bwa mem算法将读物映射到我的程序集,并使用samtools depth提取了每个碱基的读数(= coverage).生成的文件如下:

I mapped my reads to my assembly using the bwa mem algorithm and extracted the number of reads per base (= coverage) using samtools depth. The resulting file is the following:

1091900001  1   236
1091900001  2   245
1091900001  3   265
1091900001  4   283
1091900001  5   288
1091900002  1   297
1091900002  2   312
1091900002  3   327
1091900002  4   338
1091900002  5   348

具有三列:重叠群的名称(由于它是一个多重重叠群文件,因此此ID会更改)-位置(基数)-映射的读取次数(覆盖率).

with three columns: name of the contig (since it is a multi-contig file, this ID changes) - position (base) - number of reads that mapped (coverage).

现在,我想计算滑动窗口中的覆盖率(第三列);每个重叠区(第一列)的窗口大小为3,滑动的平均值为2.

Now I want to calculate the coverage (third column) in sliding windows; in a window size of 3 and slide of 2 as the mean - per contig (first column).

我想使用zoo软件包的rollapply功能.

I want to use the rollapply function of the zoo package.

require(zoo)
cov <- read.table("file",header=FALSE, sep="\t", na.strings="NA", dec=".", strip.white=TRUE)
library(reshape) #loads the library to rename the column names
cov<-rename(cov,c(V1="Chr", V2="locus", V3="depth")) #renames the header
rollapply(cov$depth, width = 3, by = 2, FUN = mean, align = "left")

但是,这当然没有考虑重叠群.另外,我的预期输出应包括contig-info&窗口,它是计算得出的:

But this of course does not take into account the contig. Plus, my expected output should include the contig-info & the window, it was calculated:

1091900001  1   3   248.6667
1091900001  3   5   278.6667
1091900002  1   3   312.0000
1091900002  3   5   337.6667

R中有一种简单的方法吗?

Is there an easy way for doing that in R?

推荐答案

以下是使用dplyr函数group_bydo的方法:

Here's how you can do this with the dplyr functions group_by and do:

library(dplyr)

cov %>% 
  group_by(Chr) %>% 
  do(
    data.frame(
      window.start = rollapply(.$locus, width=3, by=2, FUN=min, align="left"),
      window.end = rollapply(.$locus, width=3, by=2, FUN=max, align="left"),
      coverage = rollapply(.$depth, width=3, by=2, FUN=mean, align="left")
      )
    )

# # A tibble: 4 x 4
# # Groups:   Chr [2]
#          Chr window.start window.end coverage
#        <int>        <int>      <int>    <dbl>
# 1 1091900001            1          3 248.6667
# 2 1091900001            3          5 278.6667
# 3 1091900002            1          3 312.0000
# 4 1091900002            3          5 337.6667

do允许您以data.frame的形式从分组操作中返回任意数量的值.在这种情况下,我们返回覆盖率值的滚动平均值,以及每个窗口中locus中的minmax值.

do allows you to return an arbitrary number of values from grouped operations in the form of a data.frame. In this case, we return a the rolling mean of the coverage value, along with the min and max values from locus from each window.

如果数据集很大,则最好使用

If your dataset is large, you may be better off performing the calculation using data.table. Its syntax is a bit harder to understand if you haven't seen it before, but it can offer substantial speed improvements in grouped operations on bigger data. Here's how your operation works with data.table:

library(data.table)    

setDT(cov)
cov[, .(
      window.start = rollapply(locus, width=3, by=2, FUN=min, align="left"),
      window.end = rollapply(locus, width=3, by=2, FUN=max, align="left"),
      coverage = rollapply(depth, width=3, by=2, FUN=mean, align="left")
      ),
    .(Chr)]

根据您提供的示例行,以下是dplyrdata.table方法的基准测试结果(以毫秒为单位):

Based on the sample rows you provided, here are the benchmark results for the dplyr and data.table approaches (measured in milliseconds):

# dplyr:
      min       lq     mean   median       uq      max neval
 7.811753 8.685976 10.10268 9.243551 10.42691 144.5274  1000

# data.table:
      min       lq     mean  median       uq      max neval
 1.924472 2.105459 2.510832 2.30479 2.685706 8.848451  1000

因此,在样本数据上,data.table选项平均快4倍.

So on the sample data, the data.table option is about 4x faster on average.

这篇关于基因组覆盖率作为滑动窗口的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆