重叠段R [英] overlapping segments R
问题描述
有一个我正在工作的数据框,看起来像这样
两列表示一个块的开始和结束。我需要知道从0到23110906的每个位置都有这些块中有多少个。有时块大小重叠,有时候可能会有一个没有块的地方。它像R中的段,但我不需要可视化,我只需要一种方法来快速查找每个位置上的块数。有一个简单的方法吗?
这里有一些数据
m =矩阵(c(10,20,25,30),2)
IRanges 概念是 coverage ()
> cvg = coverage(IRanges(start = m [,1],end = m [,2]))
> cvg
整数 - 长度为30的运行,4运行
长度:9 10 6 5
值:0 1 2 1
这是一个紧凑的游程长度编码;在第i个位置查询
> cvg [22]
integer-Rle的长度为1,运行
长度:1
值:2
> runValue(cvg [22])
[1] 2
/ p>
> cvg> 1
logical-Rle长度为30,3个运行
长度:19 6 5
值:FALSE TRUE FALSE
或胁迫到整数向量
> as(cvg,integer)
[1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1
这个
> ; cumsum(表(m [,1],30)) - cumsum(表(m [,2],30))
[1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 0
也将相当快。 / p>
注意这些之间的微妙差异,从区别是否包括结束的差异(IRanges:yes; tabulate:no)在范围内的差异。如果这些实际上是基因组坐标,那么 GenomicRanges 是可以去的地方考虑到seqname(染色体)和链。
there is a data frame with which I am working it looks like this
the two columns denote start and end of a chunk. I need to know how many of these chunks are present at every position from 0 to 23110906. Sometimes the chunks overlap and sometimes there might be a region which has no chunk covering at all. It is like segments in R. but I dont need a visualisation I just need a way to find quickly the number of chunks at every postion. Is there an easy way?
Here's some data
m = matrix(c(10, 20, 25, 30), 2)
An IRanges notion is coverage()
> cvg = coverage(IRanges(start=m[,1], end=m[,2]))
> cvg
integer-Rle of length 30 with 4 runs
Lengths: 9 10 6 5
Values : 0 1 2 1
Which is a compact run-length encoding; query at the ith location
> cvg[22]
integer-Rle of length 1 with 1 run
Lengths: 1
Values : 2
> runValue(cvg[22])
[1] 2
Do math on the Rle
> cvg > 1
logical-Rle of length 30 with 3 runs
Lengths: 19 6 5
Values : FALSE TRUE FALSE
or coerce to an integer vector
> as(cvg, "integer")
[1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1
This
> cumsum(tabulate(m[,1], 30)) - cumsum(tabulate(m[,2], 30))
[1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 0
will also be reasonably fast.
Note subtle differences between these, from differences in the notion of whether the ends are included (IRanges: yes; tabulate: no) in the range. If these are actually genome coordinates then GenomicRanges is the place to go, to account for seqname (chromosome) and strand.
这篇关于重叠段R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!