查找两个区间数据之间的重叠范围 [英] Finding overlapping ranges between two interval data
问题描述
我有一张坐标(start
, end
)的表格.500000 个片段和另一个带有 60000 个单个坐标的表,我想与以前的片段匹配.即,对于 dtCoords
表中的每条记录,我需要在 dtFrags
表中搜索具有相同 chr
和 start
的记录><=coord
<=end
(并从 dtFrags
的这条记录中检索 type
).为此使用 R 是个好主意,还是我应该看看其他语言?
I have one table with coordinates (start
, end
) of ca. 500000 fragments and another table with 60000 single coordinates that I would like to match with the former fragments. I.e., for each record from dtCoords
table I need to search a record in dtFrags
table having the same chr
and start
<=coord
<=end
(and retrieve the type
from this record of dtFrags
). Is it good idea at all to use R for this, or I should rather look to other languages?
这是我的例子:
require(data.table)
dtFrags <- fread(
"id,chr,start,end,type
1,1,100,200,exon
2,2,300,500,intron
3,X,400,600,intron
4,2,250,600,exon
")
dtCoords <- fread(
"id,chr,coord
10,1,150
20,2,300
30,Y,500
")
最后,我想要这样的东西:
At the end, I would like to have something like this:
"idC,chr,coord,idF,type
10, 1, 150, 1, exon
20, 2, 300, 2, intron
20, 2, 300, 4, exon
30, Y, 500, NA, NA
"
我可以通过 chr
将表拆分为子表来简化任务,所以我只关注坐标
I can simplify a bit the task by splitting the table to subtables by chr
, so I would concentrate only on coordinates
setkey(dtCoords, 'chr')
setkey(dtFrags, 'chr')
for (chr in unique(dtCoords$chr)) {
dtCoordsSub <- dtCoords[chr];
dtFragsSub <- dtFrags[chr];
dtCoordsSub[, {
# ????
}, by=id]
}
但我仍然不清楚我应该如何在内部工作......我将非常感谢任何提示.
but it's still not clear for me how should I work inside... I would be very grateful for any hints.
UPD.以防万一,我将我的真实表格放入存档这里.解压到您的工作目录后,可以使用以下代码加载表:
UPD. just in case, I put my real table in the archive here. After unpacking to your working directory, tables can be loaded with the following code:
dtCoords <- fread("dtCoords.txt", sep=" ", header=TRUE)
dtFrags <- fread("dtFrags.txt", sep=" ", header=TRUE)
推荐答案
一般情况下,使用bioconductor 将 IRanges
包到处理与间隔有关的问题.它通过实现间隔树来有效地做到这一点.GenomicRanges
是另一个构建的包在 IRanges
之上,专门用于处理基因组范围".
In general, it's very appropriate to use the bioconductor package IRanges
to deal with problems related to intervals. It does so efficiently by implementing interval tree. GenomicRanges
is another package that builds on top of IRanges
, specifically for handling, well, "Genomic Ranges".
require(GenomicRanges)
gr1 = with(dtFrags, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(start, end)))
gr2 = with(dtCoords, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(coord, coord)))
olaps = findOverlaps(gr2, gr1)
dtCoords[, grp := seq_len(nrow(dtCoords))]
dtFrags[subjectHits(olaps), grp := queryHits(olaps)]
setkey(dtCoords, grp)
setkey(dtFrags, grp)
dtFrags[, list(grp, id, type)][dtCoords]
grp id type id.1 chr coord
1: 1 1 exon 10 1 150
2: 2 2 intron 20 2 300
3: 2 4 exon 20 2 300
4: 3 NA NA 30 Y 500
这篇关于查找两个区间数据之间的重叠范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!