查找两个间隔数据之间的重叠范围 [英] Finding overlapping ranges between two interval data

查看:158
本文介绍了查找两个间隔数据之间的重叠范围的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个坐标为( start end )的表。 500000个片段和另一个具有60000个单一坐标的表,我想与前面的片段匹配。也就是说,对于 dtCoords 表中的每个记录,我需要在 dtFrags 表中搜索具有相同 chr 开始< = coord < = end (并从 dtFrags 的记录中检索类型)。



这是我的例子:

  require(data.table)

dtFrags< - fread(
id,chr,start,end,type
1,1,100,200,外显子
2,2,300,500,内含子
3,X,400,600,内含子
4,2,250,600,外显子

$ b b dtCoords< - fread(
id,chr,coord
10,1,150
20,2,300
30,Y,500

最后,我想要这样:

 idC,chr,coord,idF,type 
10,1,150,1,外显子
20,2,300,2,内含子
20,2,300,4,exon
30,Y,500,NA,NA

$ b b

我可以通过 chr 将表分割为子表来简化任务,所以我只关注坐标

  setkey(dtCoords,'chr')
setkey(dtFrags,'chr')

for unique(dtCoords $ chr)){
dtCoordsSub< - dtCoords [chr];
dtFragsSub< - dtFrags [chr];
dtCoordsSub [,{
#????
},by = id]
}

我应该如何在里面工作...我非常感谢任何提示。



UPD。为了防万一,我将我的真实表放在存档此处 。在解压到工作目录后,表格可以加载以下代码:

  dtCoords<  -  fread(dtCoords.txt ,sep =\t,header = TRUE)
dtFrags< - fread(dtFrags.txt,sep =\t,header = TRUE)


解决方案

一般来说,使用 bioconductor 包装 IRanges 来处理与间隔相关的问题。它通过实施间隔树有效率。 GenomicRanges 是另一个建立在 IRanges 之上的软件包,专门用于处理基因组范围。

  require(GenomicRanges)
gr1 = with(dtFrags,GRanges(Rle(factor(chr,
levels = c(1,2,X ,Y))),IRanges(start,end)))
gr2 = with(dtCoords,GRanges(Rle(factor(chr,
levels = c(1,2 x,Y))),IRanges(coord,coord))
olaps = findOverlaps(gr2,gr1)
dtCoords [,grp:= seq_len b $ b dtFrags [subjectHits(olaps),grp:= queryHits(olaps)]
setkey(dtCoords,grp)
setkey(dtFrags,grp)
dtFrags [,list ,type)] [dtCoords]

grp id type id.1 chr coord
1:1 1外显子10 1 150
2:2 2内含子20 2 300
3:2 4外显子20 2 300
4:3 NA NA 30 Y 500


I have one table with coordinates (start, end) of ca. 500000 fragments and another table with 60000 single coordinates that I would like to match with the former fragments. I.e., for each record from dtCoords table I need to search a record in dtFrags table having the same chr and start<=coord<=end (and retrieve the type from this record of dtFrags). Is it good idea at all to use R for this, or I should rather look to other languages?

Here is my example:

require(data.table)

dtFrags <- fread(
  "id,chr,start,end,type
 1,1,100,200,exon
 2,2,300,500,intron
 3,X,400,600,intron
 4,2,250,600,exon
")

dtCoords <- fread(
"id,chr,coord
 10,1,150
 20,2,300
 30,Y,500
")

At the end, I would like to have something like this:

"idC,chr,coord,idF,type
 10,  1,  150,  1, exon
 20,  2,  300,  2, intron
 20,  2,  300,  4, exon
 30,  Y,  500, NA, NA
"

I can simplify a bit the task by splitting the table to subtables by chr, so I would concentrate only on coordinates

setkey(dtCoords, 'chr')
setkey(dtFrags,  'chr')

for (chr in unique(dtCoords$chr)) {
  dtCoordsSub <- dtCoords[chr];
  dtFragsSub  <-  dtFrags[chr];
  dtCoordsSub[, {
    # ????  
  }, by=id]  
}

but it's still not clear for me how should I work inside... I would be very grateful for any hints.

UPD. just in case, I put my real table in the archive here. After unpacking to your working directory, tables can be loaded with the following code:

dtCoords <- fread("dtCoords.txt", sep="\t", header=TRUE)
dtFrags  <- fread("dtFrags.txt",  sep="\t", header=TRUE)

解决方案

In general, it's very appropriate to use the bioconductor package IRanges to deal with problems related to intervals. It does so efficiently by implementing interval tree. GenomicRanges is another package that builds on top of IRanges, specifically for handling, well, "Genomic Ranges".

require(GenomicRanges)
gr1 = with(dtFrags, GRanges(Rle(factor(chr, 
          levels=c("1", "2", "X", "Y"))), IRanges(start, end)))
gr2 = with(dtCoords, GRanges(Rle(factor(chr, 
          levels=c("1", "2", "X", "Y"))), IRanges(coord, coord)))
olaps = findOverlaps(gr2, gr1)
dtCoords[, grp := seq_len(nrow(dtCoords))]
dtFrags[subjectHits(olaps), grp := queryHits(olaps)]
setkey(dtCoords, grp)
setkey(dtFrags, grp)
dtFrags[, list(grp, id, type)][dtCoords]

   grp id   type id.1 chr coord
1:   1  1   exon   10   1   150
2:   2  2 intron   20   2   300
3:   2  4   exon   20   2   300
4:   3 NA     NA   30   Y   500

这篇关于查找两个间隔数据之间的重叠范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆