查找两个区间数据之间的重叠范围 [英] Finding overlapping ranges between two interval data

查看:16
本文介绍了查找两个区间数据之间的重叠范围的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一张坐标(start, end)的表格.500000 个片段和另一个带有 60000 个单个坐标的表,我想与以前的片段匹配.即,对于 dtCoords 表中的每条记录,我需要在 dtFrags 表中搜索具有相同 chrstart 的记录><=coord<=end(并从 dtFrags 的这条记录中检索 type).为此使用 R 是个好主意,还是我应该看看其他语言?

I have one table with coordinates (start, end) of ca. 500000 fragments and another table with 60000 single coordinates that I would like to match with the former fragments. I.e., for each record from dtCoords table I need to search a record in dtFrags table having the same chr and start<=coord<=end (and retrieve the type from this record of dtFrags). Is it good idea at all to use R for this, or I should rather look to other languages?

这是我的例子:

require(data.table)

dtFrags <- fread(
  "id,chr,start,end,type
 1,1,100,200,exon
 2,2,300,500,intron
 3,X,400,600,intron
 4,2,250,600,exon
")

dtCoords <- fread(
"id,chr,coord
 10,1,150
 20,2,300
 30,Y,500
")

最后,我想要这样的东西:

At the end, I would like to have something like this:

"idC,chr,coord,idF,type
 10,  1,  150,  1, exon
 20,  2,  300,  2, intron
 20,  2,  300,  4, exon
 30,  Y,  500, NA, NA
"

我可以通过 chr 将表拆分为子表来简化任务,所以我只关注坐标

I can simplify a bit the task by splitting the table to subtables by chr, so I would concentrate only on coordinates

setkey(dtCoords, 'chr')
setkey(dtFrags,  'chr')

for (chr in unique(dtCoords$chr)) {
  dtCoordsSub <- dtCoords[chr];
  dtFragsSub  <-  dtFrags[chr];
  dtCoordsSub[, {
    # ????  
  }, by=id]  
}

但我仍然不清楚我应该如何在内部工作......我将非常感谢任何提示.

but it's still not clear for me how should I work inside... I would be very grateful for any hints.

UPD.以防万一,我将我的真实表格放入存档这里.解压到您的工作目录后,可以使用以下代码加载表:

UPD. just in case, I put my real table in the archive here. After unpacking to your working directory, tables can be loaded with the following code:

dtCoords <- fread("dtCoords.txt", sep="	", header=TRUE)
dtFrags  <- fread("dtFrags.txt",  sep="	", header=TRUE)

推荐答案

一般情况下,使用bioconductorIRanges 包到处理与间隔有关的问题.它通过实现间隔树来有效地做到这一点.GenomicRanges 是另一个构建的包在 IRanges 之上,专门用于处理基因组范围".

In general, it's very appropriate to use the bioconductor package IRanges to deal with problems related to intervals. It does so efficiently by implementing interval tree. GenomicRanges is another package that builds on top of IRanges, specifically for handling, well, "Genomic Ranges".

require(GenomicRanges)
gr1 = with(dtFrags, GRanges(Rle(factor(chr, 
          levels=c("1", "2", "X", "Y"))), IRanges(start, end)))
gr2 = with(dtCoords, GRanges(Rle(factor(chr, 
          levels=c("1", "2", "X", "Y"))), IRanges(coord, coord)))
olaps = findOverlaps(gr2, gr1)
dtCoords[, grp := seq_len(nrow(dtCoords))]
dtFrags[subjectHits(olaps), grp := queryHits(olaps)]
setkey(dtCoords, grp)
setkey(dtFrags, grp)
dtFrags[, list(grp, id, type)][dtCoords]

   grp id   type id.1 chr coord
1:   1  1   exon   10   1   150
2:   2  2 intron   20   2   300
3:   2  4   exon   20   2   300
4:   3 NA     NA   30   Y   500

这篇关于查找两个区间数据之间的重叠范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆