重叠加入起点和终点 [英] Overlap join with start and end positions

查看:118
本文介绍了重叠加入起点和终点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请考虑以下data.table.第一个定义了一组区域,每个组"x"的开始和结束位置:

library(data.table)

d1 <- data.table(x = letters[1:5], start = c(1,5,19,30, 7), end = c(3,11,22,39,25))
setkey(d1, x, start)

#    x start end
# 1: a     1   3
# 2: b     5  11
# 3: c    19  22
# 4: d    30  39
# 5: e     7  25

第二个数据集具有相同的分组变量"x",并在每个组中定位"pos":

d2 <- data.table(x = letters[c(1,1,2,2,3:5)], pos = c(2,3,3,12,20,52,10))
setkey(d2, x, pos)

#    x pos
# 1: a   2
# 2: a   3
# 3: b   3
# 4: b  12
# 5: c  20
# 6: d  52
# 7: e  10

最终,我想在每个组x中提取'd2'中'pos'属于'start'和'end'所定义范围内的行.理想的结果是

#    x pos start  end
# 1: a   2     1    3
# 2: a   3     1    3
# 3: c  20    19   22
# 4: e  10     7   25

任何组x的开始/结束位置都不会重叠,但是可能存在不在任何区域中的值的间隙.

现在,我相信我应该使用滚动连接.据我所知,我无法在联接中使用结束"列.

我尝试过

d1[d2, roll = TRUE, nomatch = 0, mult = "all"][start <= end]

得到了

#    x start end
# 1: a     2   3
# 2: a     3   3
# 3: c    20  22
# 4: e    10  25

这是我想要的正确行集;但是,"pos"已变为"start",原始的"start"已丢失.有没有一种方法可以保留所有带有滚动联接的列,以便我可以根据需要报告开始",位置",结束"?

解决方案

重叠连接是通过 commit 1375 实现的github.com/Rdatatable/data.table"rel =" noreferrer> data.table v1.9.3 ,可在新闻:

29)Overlap joins #528 终于来了!除了type="equal"maxgapminoverlap自变量外,其他所有内容均已实现.查看?foverlaps及其使用示例.这是data.table的主要功能.

让我们考虑x,一个定义为[a, b]的间隔,其中a <= b,和y,另一个定义为[c, d]的间隔,其中c <= d.间隔y完全说成重叠 x,只要d >= a c <= b 2 .对于实现的不同类型的重叠,请查看?foverlaps.

您的问题是重叠连接的一种特殊情况:在d1中,您具有startend位置的真实物理间隔.另一方面,在d2中只有位置(pos),没有间隔.为了能够进行重叠联接,我们还需要在d2中创建间隔.这是通过创建一个附加变量pos2来实现的,该变量与pos(d2[, pos2 := pos])相同.因此,我们现在在d2中有了一个间隔,尽管具有相同的 start end 坐标.然后,可以将d2中的虚拟零宽度间隔"用于foverlap中,以与d1进行重叠连接:

require(data.table) ## 1.9.3
setkey(d1)
d2[, pos2 := pos]
foverlaps(d2, d1, by.x = names(d2), type = "within", mult = "all", nomatch = 0L)
#    x start end pos pos2
# 1: a     1   3   2    2
# 2: a     1   3   3    3
# 3: c    19  22  20   20
# 4: e     7  25  10   10

by.y默认为key(y),因此我们跳过了它.默认情况下,by.x采用key(x)(如果不存在),而采用key(y).但是d2的键不存在,并且我们无法设置y的列,因为它们的名称不同.因此,我们明确设置了by.x.

重叠的类型 inner ,并且我们希望拥有 all 匹配项,除非存在匹配项.

注意:foverlaps在后台使用了data.table的二进制搜索功能(必要时还有roll),但是某些函数参数(重叠类型,maxgap,minoverlap等)受功能findOverlaps()来自Bioconductor软件包 IRanges ,这是一个出色的软件包(以及 GenomicRanges 也是如此,它为基因组学扩展了IRanges ).


那有什么好处?

对您的数据上面的代码进行基准测试会导致foverlaps()比Gabor的答案慢(时间:Gabor的data.table解决方案= 0.004对foverlaps = 0.021秒).但是,在这种粒度下真的重要吗?

真正有趣的是,看看它的扩展性如何-在速度内存方面.在Gabor的答案中,我们基于键列x加入. 然后过滤结果.

如果d1具有约40K行,而d2具有10万行(或更多)怎么办?对于与d1中的x匹配的d2中的每行 all 的那些行将被匹配并返回,仅在以后进行过滤.这是您的Q仅略微缩放的示例:

生成数据:

require(data.table)
set.seed(1L)
n = 20e3L; k = 100e3L
idx1 = sample(100, n, TRUE)
idx2 = sample(100, n, TRUE)
d1 = data.table(x = sample(letters[1:5], n, TRUE), 
                start = pmin(idx1, idx2), 
                end = pmax(idx1, idx2))

d2 = data.table(x = sample(letters[1:15], k, TRUE), 
                pos1 = sample(60:150, k, TRUE))

翻盖:

system.time({
    setkey(d1)
    d2[, pos2 := pos1]
    ans1 = foverlaps(d2, d1, by.x=1:3, type="within", nomatch=0L)
})
# user  system elapsed 
#   3.028   0.635   3.745 

这总共占用了约1GB的内存,其中ans1的内存为420MB.实际上,这里大部分时间都花在子集上.您可以通过设置参数verbose=TRUE进行检查.

Gabor的解决方案:

## new session - data.table solution
system.time({
    setkey(d1, x)
    ans2 <- d1[d2, allow.cartesian=TRUE, nomatch=0L][between(pos1, start, end)]
})
#   user  system elapsed 
# 15.714   4.424  20.324 

总共花了约3.5GB.

我刚刚指出,Gabor已经提到中间结果所需的内存.因此,尝试sqldf:

# new session - sqldf solution
system.time(ans3 <- sqldf("select * from d1 join 
            d2 using (x) where pos1 between start and end"))
#   user  system elapsed 
# 73.955   1.605  77.049 

总共占用了约1.4GB.因此,它肯定比上面显示的占用更少的内存.

[从ans1中删除pos2并在两个答案上都设置了键之后,答案就被确认是相同的.]

请注意,这种重叠连接的设计存在以下问题:d2不一定具有相同的开始和结束坐标(例如:基因组学,我来自的领域,其中d2通常约为30-1.5亿或更多行).


foverlaps()是稳定的,但仍在开发中,这意味着某些参数和名称可能会更改.

注意:自从我在上面提到GenomicRanges以来,它也完全有能力解决此问题.它在引擎盖下使用间隔树 ,而且记忆效率也很高.在我的基因组数据基准测试中,foverlaps()更快.但这是另一篇博客文章.

Consider the following data.tables. The first defines a set of regions with start and end positions for each group 'x':

library(data.table)

d1 <- data.table(x = letters[1:5], start = c(1,5,19,30, 7), end = c(3,11,22,39,25))
setkey(d1, x, start)

#    x start end
# 1: a     1   3
# 2: b     5  11
# 3: c    19  22
# 4: d    30  39
# 5: e     7  25

The second data set has the same grouping variable 'x', and positions 'pos' within each group:

d2 <- data.table(x = letters[c(1,1,2,2,3:5)], pos = c(2,3,3,12,20,52,10))
setkey(d2, x, pos)

#    x pos
# 1: a   2
# 2: a   3
# 3: b   3
# 4: b  12
# 5: c  20
# 6: d  52
# 7: e  10

Ultimately I'd like to extract the rows in 'd2' where 'pos' falls within the range defined by 'start' and 'end', within each group x. The desired result is

#    x pos start  end
# 1: a   2     1    3
# 2: a   3     1    3
# 3: c  20    19   22
# 4: e  10     7   25

The start/end positions for any group x will never overlap but there may be gaps of values not in any region.

Now, I believe I should be using a rolling join. From what i can tell, I cannot use the "end" column in the join.

I've tried

d1[d2, roll = TRUE, nomatch = 0, mult = "all"][start <= end]

and got

#    x start end
# 1: a     2   3
# 2: a     3   3
# 3: c    20  22
# 4: e    10  25

which is the right set of rows I want; However "pos" has become "start" and the original "start" has been lost. Is there a way to preserve all the columns with the roll join so i could report "start", "pos", "end" as desired?

解决方案

Overlap joins was implemented with commit 1375 in data.table v1.9.3, and is available in the current stable release, v1.9.4. The function is called foverlaps. From NEWS:

29) Overlap joins #528 is now here, finally!! Except for type="equal" and maxgap and minoverlap arguments, everything else is implemented. Check out ?foverlaps and the examples there on its usage. This is a major feature addition to data.table.

Let's consider x, an interval defined as [a, b], where a <= b, and y, another interval defined as [c, d], where c <= d. The interval y is said to overlap x at all, iff d >= a and c <= b 1. And y is entirely contained within x, iff a <= c,d <= b 2. For the different types of overlaps implemented, please have a look at ?foverlaps.

Your question is a special case of an overlap join: in d1 you have true physical intervals with start and end positions. In d2 on the other hand, there are only positions (pos), not intervals. To be able to do an overlap join, we need to create intervals also in d2. This is achieved by creating an additional variable pos2, which is identical to pos (d2[, pos2 := pos]). Thus, we now have an interval in d2, albeit with identical start and end coordinates. This 'virtual, zero-width interval' in d2 can then be used in foverlap to do an overlap join with d1:

require(data.table) ## 1.9.3
setkey(d1)
d2[, pos2 := pos]
foverlaps(d2, d1, by.x = names(d2), type = "within", mult = "all", nomatch = 0L)
#    x start end pos pos2
# 1: a     1   3   2    2
# 2: a     1   3   3    3
# 3: c    19  22  20   20
# 4: e     7  25  10   10

by.y by default is key(y), so we skipped it. by.x by default takes key(x) if it exists, and if not takes key(y). But a key doesn't exist for d2, and we can't set the columns from y, because they don't have the same names. So, we set by.x explicitly.

The type of overlap is within, and we'd like to have all matches, only if there is a match.

NB: foverlaps uses data.table's binary search feature (along with roll where necessary) under the hood, but some function arguments (types of overlaps, maxgap, minoverlap etc..) are inspired by the function findOverlaps() from the Bioconductor package IRanges, an excellent package (and so is GenomicRanges, which extends IRanges for Genomics).


So what's the advantage?

A benchmark on the code above on your data results in foverlaps() slower than Gabor's answer (Timings: Gabor's data.table solution = 0.004 vs foverlaps = 0.021 seconds). But does it really matter at this granularity?

What would be really interesting is to see how well it scales - in terms of both speed and memory. In Gabor's answer, we join based on the key column x. And then filter the results.

What if d1 has about 40K rows and d2 has a 100K rows (or more)? For each row in d2 that matches x in d1, all those rows will be matched and returned, only to be filtered later. Here's an example of your Q scaled only slightly:

Generate data:

require(data.table)
set.seed(1L)
n = 20e3L; k = 100e3L
idx1 = sample(100, n, TRUE)
idx2 = sample(100, n, TRUE)
d1 = data.table(x = sample(letters[1:5], n, TRUE), 
                start = pmin(idx1, idx2), 
                end = pmax(idx1, idx2))

d2 = data.table(x = sample(letters[1:15], k, TRUE), 
                pos1 = sample(60:150, k, TRUE))

foverlaps:

system.time({
    setkey(d1)
    d2[, pos2 := pos1]
    ans1 = foverlaps(d2, d1, by.x=1:3, type="within", nomatch=0L)
})
# user  system elapsed 
#   3.028   0.635   3.745 

This took ~ 1GB of memory in total, out of which ans1 is 420MB. Most of the time spent here is on subset really. You can check it by setting the argument verbose=TRUE.

Gabor's solutions:

## new session - data.table solution
system.time({
    setkey(d1, x)
    ans2 <- d1[d2, allow.cartesian=TRUE, nomatch=0L][between(pos1, start, end)]
})
#   user  system elapsed 
# 15.714   4.424  20.324 

And this took a total of ~3.5GB.

I just noted that Gabor already mentions the memory required for intermediate results. So, trying out sqldf:

# new session - sqldf solution
system.time(ans3 <- sqldf("select * from d1 join 
            d2 using (x) where pos1 between start and end"))
#   user  system elapsed 
# 73.955   1.605  77.049 

Took a total of ~1.4GB. So, it definitely uses less memory than the one shown above.

[The answers were verified to be identical after removing pos2 from ans1 and setting key on both answers.]

Note that this overlap join is designed with problems where d2 doesn't necessarily have identical start and end coordinates (ex: genomics, the field where I come from, where d2 is usually about 30-150 million or more rows).


foverlaps() is stable, but is still under development, meaning some arguments and names might get changed.

NB: Since I mentioned GenomicRanges above, it is also perfectly capable of solving this problem. It uses interval trees under the hood, and is quite memory efficient as well. In my benchmarks on genomics data, foverlaps() is faster. But that's for another (blog) post, some other time.

这篇关于重叠加入起点和终点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆