重叠段R [英] overlapping segments R

查看:188
本文介绍了重叠段R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一个我正在工作的数据框,看起来像这样



两列表示一个块的开始和结束。我需要知道从0到23110906的每个位置都有这些块中有多少个。有时块大小重叠,有时候可能会有一个没有块的地方。它像R中的段,但我不需要可视化,我只需要一种方法来快速查找每个位置上的块数。有一个简单的方法吗?

解决方案

这里有一些数据

  m =矩阵(c(10,20,25,30),2)

IRanges 概念是 coverage ()

 > cvg = coverage(IRanges(start = m [,1],end = m [,2]))
> cvg
整数 - 长度为30的运行,4运行
长度:9 10 6 5
值:0 1 2 1

这是一个紧凑的游程长度编码;在第i个位置查询

 > cvg [22] 
integer-Rle的长度为1,运行
长度:1
值:2
> runValue(cvg [22])
[1] 2

/ p>

 > cvg> 1 
logical-Rle长度为30,3个运行
长度:19 6 5
值:FALSE TRUE FALSE

或胁迫到整数向量

 > as(cvg,integer)
[1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1

这个

 > ; cumsum(表(m [,1],30)) -  cumsum(表(m [,2],30))
[1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 0

也将相当快。 / p>

注意这些之间的微妙差异,从区别是否包括结束的差异(IRanges:yes; tabulate:no)在范围内的差异。如果这些实际上是基因组坐标,那么 GenomicRanges 是可以去的地方考虑到seqname(染色体)和链。


there is a data frame with which I am working it looks like this

the two columns denote start and end of a chunk. I need to know how many of these chunks are present at every position from 0 to 23110906. Sometimes the chunks overlap and sometimes there might be a region which has no chunk covering at all. It is like segments in R. but I dont need a visualisation I just need a way to find quickly the number of chunks at every postion. Is there an easy way?

解决方案

Here's some data

m = matrix(c(10, 20, 25, 30), 2)

An IRanges notion is coverage()

> cvg = coverage(IRanges(start=m[,1], end=m[,2]))
> cvg
integer-Rle of length 30 with 4 runs
  Lengths:  9 10  6  5
  Values :  0  1  2  1

Which is a compact run-length encoding; query at the ith location

> cvg[22]
integer-Rle of length 1 with 1 run
  Lengths: 1
  Values : 2
> runValue(cvg[22])
[1] 2

Do math on the Rle

> cvg > 1
logical-Rle of length 30 with 3 runs
  Lengths:    19     6     5
  Values : FALSE  TRUE FALSE

or coerce to an integer vector

> as(cvg, "integer")
 [1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1

This

> cumsum(tabulate(m[,1], 30)) - cumsum(tabulate(m[,2], 30))
 [1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 0

will also be reasonably fast.

Note subtle differences between these, from differences in the notion of whether the ends are included (IRanges: yes; tabulate: no) in the range. If these are actually genome coordinates then GenomicRanges is the place to go, to account for seqname (chromosome) and strand.

这篇关于重叠段R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆