基因组覆盖率作为滑动窗口 [英] Genome coverage as sliding window
问题描述
我使用bwa mem
算法将读物映射到我的程序集,并使用samtools depth
提取了每个碱基的读数(= coverage).生成的文件如下:
I mapped my reads to my assembly using the bwa mem
algorithm and extracted the number of reads per base (= coverage) using samtools depth
. The resulting file is the following:
1091900001 1 236
1091900001 2 245
1091900001 3 265
1091900001 4 283
1091900001 5 288
1091900002 1 297
1091900002 2 312
1091900002 3 327
1091900002 4 338
1091900002 5 348
具有三列:重叠群的名称(由于它是一个多重重叠群文件,因此此ID会更改)-位置(基数)-映射的读取次数(覆盖率).
with three columns: name of the contig (since it is a multi-contig file, this ID changes) - position (base) - number of reads that mapped (coverage).
现在,我想计算滑动窗口中的覆盖率(第三列);每个重叠区(第一列)的窗口大小为3,滑动的平均值为2.
Now I want to calculate the coverage (third column) in sliding windows; in a window size of 3 and slide of 2 as the mean - per contig (first column).
我想使用zoo
软件包的rollapply
功能.
I want to use the rollapply
function of the zoo
package.
require(zoo)
cov <- read.table("file",header=FALSE, sep="\t", na.strings="NA", dec=".", strip.white=TRUE)
library(reshape) #loads the library to rename the column names
cov<-rename(cov,c(V1="Chr", V2="locus", V3="depth")) #renames the header
rollapply(cov$depth, width = 3, by = 2, FUN = mean, align = "left")
但是,这当然没有考虑重叠群.另外,我的预期输出应包括contig-info&窗口,它是计算得出的:
But this of course does not take into account the contig. Plus, my expected output should include the contig-info & the window, it was calculated:
1091900001 1 3 248.6667
1091900001 3 5 278.6667
1091900002 1 3 312.0000
1091900002 3 5 337.6667
在R
中有一种简单的方法吗?
Is there an easy way for doing that in R
?
推荐答案
以下是使用dplyr
函数group_by
和do
的方法:
Here's how you can do this with the dplyr
functions group_by
and do
:
library(dplyr)
cov %>%
group_by(Chr) %>%
do(
data.frame(
window.start = rollapply(.$locus, width=3, by=2, FUN=min, align="left"),
window.end = rollapply(.$locus, width=3, by=2, FUN=max, align="left"),
coverage = rollapply(.$depth, width=3, by=2, FUN=mean, align="left")
)
)
# # A tibble: 4 x 4
# # Groups: Chr [2]
# Chr window.start window.end coverage
# <int> <int> <int> <dbl>
# 1 1091900001 1 3 248.6667
# 2 1091900001 3 5 278.6667
# 3 1091900002 1 3 312.0000
# 4 1091900002 3 5 337.6667
do
允许您以data.frame的形式从分组操作中返回任意数量的值.在这种情况下,我们返回覆盖率值的滚动平均值,以及每个窗口中locus
中的min
和max
值.
do
allows you to return an arbitrary number of values from grouped operations in the form of a data.frame. In this case, we return a the rolling mean of the coverage value, along with the min
and max
values from locus
from each window.
If your dataset is large, you may be better off performing the calculation using data.table
. Its syntax is a bit harder to understand if you haven't seen it before, but it can offer substantial speed improvements in grouped operations on bigger data. Here's how your operation works with data.table
:
library(data.table)
setDT(cov)
cov[, .(
window.start = rollapply(locus, width=3, by=2, FUN=min, align="left"),
window.end = rollapply(locus, width=3, by=2, FUN=max, align="left"),
coverage = rollapply(depth, width=3, by=2, FUN=mean, align="left")
),
.(Chr)]
根据您提供的示例行,以下是dplyr
和data.table
方法的基准测试结果(以毫秒为单位):
Based on the sample rows you provided, here are the benchmark results for the dplyr
and data.table
approaches (measured in milliseconds):
# dplyr:
min lq mean median uq max neval
7.811753 8.685976 10.10268 9.243551 10.42691 144.5274 1000
# data.table:
min lq mean median uq max neval
1.924472 2.105459 2.510832 2.30479 2.685706 8.848451 1000
因此,在样本数据上,data.table
选项平均快4倍.
So on the sample data, the data.table
option is about 4x faster on average.
这篇关于基因组覆盖率作为滑动窗口的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!