查找两个间隔数据之间的重叠范围 [英] Finding overlapping ranges between two interval data
问题描述
我有一个坐标为( start
, end
)的表。 500000个片段和另一个具有60000个单一坐标的表,我想与前面的片段匹配。也就是说,对于 dtCoords
表中的每个记录,我需要在 dtFrags
表中搜索具有相同 chr
和开始
< = coord
< = end
(并从 dtFrags
的记录中检索类型
)。
这是我的例子:
require(data.table)
dtFrags< - fread(
id,chr,start,end,type
1,1,100,200,外显子
2,2,300,500,内含子
3,X,400,600,内含子
4,2,250,600,外显子
)
$ b b dtCoords< - fread(
id,chr,coord
10,1,150
20,2,300
30,Y,500
)
最后,我想要这样:
idC,chr,coord,idF,type
10,1,150,1,外显子
20,2,300,2,内含子
20,2,300,4,exon
30,Y,500,NA,NA
$ b b
我可以通过 chr
将表分割为子表来简化任务,所以我只关注坐标
setkey(dtCoords,'chr')
setkey(dtFrags,'chr')
for unique(dtCoords $ chr)){
dtCoordsSub< - dtCoords [chr];
dtFragsSub< - dtFrags [chr];
dtCoordsSub [,{
#????
},by = id]
}
我应该如何在里面工作...我非常感谢任何提示。
UPD。为了防万一,我将我的真实表放在存档此处 。在解压到工作目录后,表格可以加载以下代码:
dtCoords< - fread(dtCoords.txt ,sep =\t,header = TRUE)
dtFrags< - fread(dtFrags.txt,sep =\t,header = TRUE)
解决方案一般来说,使用 bioconductor 包装
IRanges
来处理与间隔相关的问题。它通过实施间隔树有效率。GenomicRanges
是另一个建立在IRanges
之上的软件包,专门用于处理基因组范围。require(GenomicRanges)
gr1 = with(dtFrags,GRanges(Rle(factor(chr,
levels = c(1,2,X ,Y))),IRanges(start,end)))
gr2 = with(dtCoords,GRanges(Rle(factor(chr,
levels = c(1,2 x,Y))),IRanges(coord,coord))
olaps = findOverlaps(gr2,gr1)
dtCoords [,grp:= seq_len b $ b dtFrags [subjectHits(olaps),grp:= queryHits(olaps)]
setkey(dtCoords,grp)
setkey(dtFrags,grp)
dtFrags [,list ,type)] [dtCoords]
grp id type id.1 chr coord
1:1 1外显子10 1 150
2:2 2内含子20 2 300
3:2 4外显子20 2 300
4:3 NA NA 30 Y 500
I have one table with coordinates (
start
,end
) of ca. 500000 fragments and another table with 60000 single coordinates that I would like to match with the former fragments. I.e., for each record fromdtCoords
table I need to search a record indtFrags
table having the samechr
andstart
<=coord
<=end
(and retrieve thetype
from this record ofdtFrags
). Is it good idea at all to use R for this, or I should rather look to other languages?Here is my example:
require(data.table) dtFrags <- fread( "id,chr,start,end,type 1,1,100,200,exon 2,2,300,500,intron 3,X,400,600,intron 4,2,250,600,exon ") dtCoords <- fread( "id,chr,coord 10,1,150 20,2,300 30,Y,500 ")
At the end, I would like to have something like this:
"idC,chr,coord,idF,type 10, 1, 150, 1, exon 20, 2, 300, 2, intron 20, 2, 300, 4, exon 30, Y, 500, NA, NA "
I can simplify a bit the task by splitting the table to subtables by
chr
, so I would concentrate only on coordinatessetkey(dtCoords, 'chr') setkey(dtFrags, 'chr') for (chr in unique(dtCoords$chr)) { dtCoordsSub <- dtCoords[chr]; dtFragsSub <- dtFrags[chr]; dtCoordsSub[, { # ???? }, by=id] }
but it's still not clear for me how should I work inside... I would be very grateful for any hints.
UPD. just in case, I put my real table in the archive here. After unpacking to your working directory, tables can be loaded with the following code:
dtCoords <- fread("dtCoords.txt", sep="\t", header=TRUE) dtFrags <- fread("dtFrags.txt", sep="\t", header=TRUE)
解决方案In general, it's very appropriate to use the bioconductor package
IRanges
to deal with problems related to intervals. It does so efficiently by implementing interval tree.GenomicRanges
is another package that builds on top ofIRanges
, specifically for handling, well, "Genomic Ranges".require(GenomicRanges) gr1 = with(dtFrags, GRanges(Rle(factor(chr, levels=c("1", "2", "X", "Y"))), IRanges(start, end))) gr2 = with(dtCoords, GRanges(Rle(factor(chr, levels=c("1", "2", "X", "Y"))), IRanges(coord, coord))) olaps = findOverlaps(gr2, gr1) dtCoords[, grp := seq_len(nrow(dtCoords))] dtFrags[subjectHits(olaps), grp := queryHits(olaps)] setkey(dtCoords, grp) setkey(dtFrags, grp) dtFrags[, list(grp, id, type)][dtCoords] grp id type id.1 chr coord 1: 1 1 exon 10 1 150 2: 2 2 intron 20 2 300 3: 2 4 exon 20 2 300 4: 3 NA NA 30 Y 500
这篇关于查找两个间隔数据之间的重叠范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!