重叠段R [英] overlapping segments R

查看：188 发布时间：2017/3/26 3:39:11 r dataframe segments

本文介绍了重叠段R的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有一个我正在工作的数据框，看起来像这样

两列表示一个块的开始和结束。我需要知道从0到23110906的每个位置都有这些块中有多少个。有时块大小重叠，有时候可能会有一个没有块的地方。它像R中的段，但我不需要可视化，我只需要一种方法来快速查找每个位置上的块数。有一个简单的方法吗？

解决方案

这里有一些数据

  m =矩阵（c（10，20，25，30），2）

IRanges 概念是 coverage （）

 > cvg = coverage（IRanges（start = m [，1]，end = m [，2]））
> cvg 
整数 - 长度为30的运行，4运行
长度：9 10 6 5 
值：0 1 2 1

这是一个紧凑的游程长度编码;在第i个位置查询

 > cvg [22] 
 integer-Rle的长度为1，运行
长度：1 
值：2 
> runValue（cvg [22]）
 [1] 2

/ p>

 > cvg> 1 
 logical-Rle长度为30，3个运行
长度：19 6 5 
值：FALSE TRUE FALSE

或胁迫到整数向量

 > as（cvg，integer）
 [1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1

这个

 > ; cumsum（表（m [，1]，30）） -  cumsum（表（m [，2]，30））
 [1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 0

也将相当快。 / p>

注意这些之间的微妙差异，从区别是否包括结束的差异（IRanges：yes; tabulate：no）在范围内的差异。如果这些实际上是基因组坐标，那么 GenomicRanges 是可以去的地方考虑到seqname（染色体）和链。

there is a data frame with which I am working it looks like this

the two columns denote start and end of a chunk. I need to know how many of these chunks are present at every position from 0 to 23110906. Sometimes the chunks overlap and sometimes there might be a region which has no chunk covering at all. It is like segments in R. but I dont need a visualisation I just need a way to find quickly the number of chunks at every postion. Is there an easy way?

解决方案

Here's some data

m = matrix(c(10, 20, 25, 30), 2)

An IRanges notion is coverage()

> cvg = coverage(IRanges(start=m[,1], end=m[,2]))
> cvg
integer-Rle of length 30 with 4 runs
  Lengths:  9 10  6  5
  Values :  0  1  2  1

Which is a compact run-length encoding; query at the ith location

> cvg[22]
integer-Rle of length 1 with 1 run
  Lengths: 1
  Values : 2
> runValue(cvg[22])
[1] 2

Do math on the Rle

> cvg > 1
logical-Rle of length 30 with 3 runs
  Lengths:    19     6     5
  Values : FALSE  TRUE FALSE

or coerce to an integer vector

> as(cvg, "integer")
 [1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1

This

> cumsum(tabulate(m[,1], 30)) - cumsum(tabulate(m[,2], 30))
 [1] 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 0

will also be reasonably fast.

Note subtle differences between these, from differences in the notion of whether the ends are included (IRanges: yes; tabulate: no) in the range. If these are actually genome coordinates then GenomicRanges is the place to go, to account for seqname (chromosome) and strand.

这篇关于重叠段R的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

重叠段R [英] overlapping segments R

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

重叠段R [英] overlapping segments R

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭