与开始和结束位置重叠连接 [英] Overlap join with start and end positions

查看:31
本文介绍了与开始和结束位置重叠连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑以下 data.tables.第一个为每个组x"定义了一组具有开始和结束位置的区域:

library(data.table)d1 <- data.table(x = 字母 [1:5], 开始 = c(1,5,19,30, 7), 结束 = c(3,11,22,39,25))设置键(d1,x,开始)# x 开始结束# 1: 1 3# 2: b 5 11# 3: c 19 22# 4: d 30 39# 5: e 7 25

第二个数据集具有相同的分组变量'x',并且在每组中定位'pos':

d2 <- data.table(x = 字母[c(1,1,2,2,3:5)], pos = c(2,3,3,12,20,52,10))设置键(d2,x,位置)# x 位置# 1:一个 2# 2: 3# 3: b 3# 4: b 12# 5: c 20# 6: d 52# 7: e 10

最终,我想在每个组 x 中提取 'd2' 中的行,其中 'pos' 落在由 'start' 和 'end' 定义的范围内.想要的结果是

# x pos start end# 1: 2 1 3# 2: 3 1 3# 3: c 20 19 22# 4: e 10 7 25

任何组x的开始/结束位置永远不会重叠,但可能存在不在任何区域中的值差距.

现在,我相信我应该使用滚动连接.据我所知,我不能在连接中使用结束"列.

我试过了

d1[d2, roll = TRUE, nomatch = 0, mult = "all"][start <= end]

并得到

# x 开始结束# 1: 2 3# 2: 3 3# 3: c 20 22# 4: e 10 25

这是我想要的正确行集;然而pos"变成了start",原来的start"已经丢失.有没有办法使用滚动连接保留所有列,以便我可以根据需要报告开始"、位置"、结束"?

解决方案

重叠连接 是在 data.table v1.9.3,可在 当前稳定版本,v1.9.4.该函数称为foverlaps.来自新闻:

<块引用>

29) 重叠连接 #528 是现在终于到了!!除了 type="equal"maxgapminoverlap 参数外,其他所有参数都已实现.查看 ?foverlaps 及其用法示例.这是 data.table 的主要功能.

让我们考虑 x,一个定义为 [a, b] 的区间,其中 a <= b,以及 y,另一个定义为 [c 的区间, d],其中 c <= d.区间 y 被认为完全重叠 x,如果 d >= a and c <= b> 1.y 完全包含在 in x 中,如果 a <= c,d <= b 2.对于实现的不同类型的重叠,请查看 ?foverlaps.

您的问题是重叠连接的一个特例:在 d1 中,您具有 startend 位置的真实物理间隔.另一方面,在 d2 中,只有位置 (pos),没有间隔.为了能够进行重叠连接,我们还需要在 d2 中创建间隔.这是通过创建一个额外的变量 pos2 来实现的,它与 pos (d2[, pos2 := pos]) 相同.因此,我们现在在 d2 中有一个区间,尽管具有相同的 startend 坐标.d2 中的这个虚拟的零宽度间隔"然后可以在 foverlap 中用于与 d1 进行重叠连接:

require(data.table) ## 1.9.3设置键(d1)d2[, pos2 := pos]foverlaps(d2, d1, by.x = names(d2), type = "within", mult = "all", nomatch = 0L)# x 开始结束位置 pos2# 1: 1 3 2 2# 2: 1 3 3 3# 3: c 19 22 20 20# 4: e 7 25 10 10

by.y 默认是 key(y),所以我们跳过了它.by.x 默认采用 key(x) 如果存在,如果不存在则采用 key(y).但是 d2 不存在键,我们不能从 y 设置列,因为它们没有相同的名称.所以,我们显式设置了 by.x.

重叠的类型inin,我们希望所有匹配,只有当有匹配时.>

注意:foverlaps 在底层使用 data.table 的二进制搜索功能(以及 roll 必要时),但一些函数参数(重叠类型、maxgap、minoverlap等..)的灵感来自 Bioconductor 包 findOverlaps()"noreferrer">IRanges,一个优秀的包(GenomicRanges,它扩展了 Genomics 的 IRanges).

<小时>

那么优点是什么?

上述代码的基准测试结果导致 foverlaps() 比 Gabor 的答案慢(时间:Gabor 的 data.table 解决方案 = 0.004 vs foverlaps = 0.021 秒).但在这个粒度上真的很重要吗?

真正有趣的是看看它在速度内存方面的扩展性如何.在 Gabor 的回答中,我们根据键列 x 加入.然后过滤结果.

如果 d1 有大约 40K 行而 d2 有 100K 行(或更多)怎么办?对于与 d1 中的 x 匹配的 d2 中的 每一行所有这些行将匹配并返回,仅在稍后过滤.这是您的 Q 仅略微缩放的示例:

生成数据:

require(data.table)种子(1L)n = 20e3L;k = 100e3Lidx1 = 样本(100,n,真)idx2 = 样本(100,n,真)d1 = data.table(x = sample(letters[1:5], n, TRUE),开始 = pmin(idx1, idx2),结束 = pmax(idx1, idx2))d2 = data.table(x = sample(letters[1:15], k, TRUE),pos1 = 样本(60:150,k,真))

重叠:

system.time({设置键(d1)d2[, pos2 := pos1]ans1 = foverlaps(d2, d1, by.x=1:3, type="within", nomatch=0L)})# 用户系统已过期# 3.028 0.635 3.745

这总共占用了大约 1GB 的内存,其中 ans1 是 420MB.在这里花费的大部分时间实际上是在子集上.您可以通过设置参数 verbose=TRUE 来检查它.

Gabor 的解决方案:

## new session - data.table 解决方案系统时间({设置键(d1,x)ans2 <- d1[d2, allow.cartesian=TRUE, nomatch=0L][between(pos1, start, end)]})# 用户系统已过期# 15.714 4.424 20.324

这总共占用了约 3.5GB.

我刚刚注意到 Gabor 已经提到了中间结果所需的内存.所以,试试 sqldf:

# new session - sqldf 解决方案system.time(ans3 <- sqldf("select * from d1 joind2 使用 (x) 其中 pos1 在开始和结束之间"))# 用户系统已过期# 73.955 1.605 77.049

总共占用了约 1.4GB.因此,它使用的内存肯定比上面显示的要少.

[从 ans1 中删除 pos2 并在两个答案上设置键后,答案被验证为相同.]

请注意,此重叠连接的设计存在以下问题:d2 不一定具有相同的开始和结束坐标(例如:基因组学,我来自的领域,其中 d2 通常约为 30-1.5 亿行或更多行).

<小时>

foverlaps() 稳定,但仍在开发中,这意味着一些参数和名称可能会更改.

注意:既然我在上面提到了GenomicRanges,它也完全有能力解决这个问题.它在底层使用 间隔树,而且内存效率也很高.在我对基因组数据的基准测试中,foverlaps() 更快.但那是另一篇(博客)文章,改天再说.

Consider the following data.tables. The first defines a set of regions with start and end positions for each group 'x':

library(data.table)

d1 <- data.table(x = letters[1:5], start = c(1,5,19,30, 7), end = c(3,11,22,39,25))
setkey(d1, x, start)

#    x start end
# 1: a     1   3
# 2: b     5  11
# 3: c    19  22
# 4: d    30  39
# 5: e     7  25

The second data set has the same grouping variable 'x', and positions 'pos' within each group:

d2 <- data.table(x = letters[c(1,1,2,2,3:5)], pos = c(2,3,3,12,20,52,10))
setkey(d2, x, pos)

#    x pos
# 1: a   2
# 2: a   3
# 3: b   3
# 4: b  12
# 5: c  20
# 6: d  52
# 7: e  10

Ultimately I'd like to extract the rows in 'd2' where 'pos' falls within the range defined by 'start' and 'end', within each group x. The desired result is

#    x pos start  end
# 1: a   2     1    3
# 2: a   3     1    3
# 3: c  20    19   22
# 4: e  10     7   25

The start/end positions for any group x will never overlap but there may be gaps of values not in any region.

Now, I believe I should be using a rolling join. From what i can tell, I cannot use the "end" column in the join.

I've tried

d1[d2, roll = TRUE, nomatch = 0, mult = "all"][start <= end]

and got

#    x start end
# 1: a     2   3
# 2: a     3   3
# 3: c    20  22
# 4: e    10  25

which is the right set of rows I want; However "pos" has become "start" and the original "start" has been lost. Is there a way to preserve all the columns with the roll join so i could report "start", "pos", "end" as desired?

解决方案

Overlap joins was implemented with commit 1375 in data.table v1.9.3, and is available in the current stable release, v1.9.4. The function is called foverlaps. From NEWS:

29) Overlap joins #528 is now here, finally!! Except for type="equal" and maxgap and minoverlap arguments, everything else is implemented. Check out ?foverlaps and the examples there on its usage. This is a major feature addition to data.table.

Let's consider x, an interval defined as [a, b], where a <= b, and y, another interval defined as [c, d], where c <= d. The interval y is said to overlap x at all, iff d >= a and c <= b 1. And y is entirely contained within x, iff a <= c,d <= b 2. For the different types of overlaps implemented, please have a look at ?foverlaps.

Your question is a special case of an overlap join: in d1 you have true physical intervals with start and end positions. In d2 on the other hand, there are only positions (pos), not intervals. To be able to do an overlap join, we need to create intervals also in d2. This is achieved by creating an additional variable pos2, which is identical to pos (d2[, pos2 := pos]). Thus, we now have an interval in d2, albeit with identical start and end coordinates. This 'virtual, zero-width interval' in d2 can then be used in foverlap to do an overlap join with d1:

require(data.table) ## 1.9.3
setkey(d1)
d2[, pos2 := pos]
foverlaps(d2, d1, by.x = names(d2), type = "within", mult = "all", nomatch = 0L)
#    x start end pos pos2
# 1: a     1   3   2    2
# 2: a     1   3   3    3
# 3: c    19  22  20   20
# 4: e     7  25  10   10

by.y by default is key(y), so we skipped it. by.x by default takes key(x) if it exists, and if not takes key(y). But a key doesn't exist for d2, and we can't set the columns from y, because they don't have the same names. So, we set by.x explicitly.

The type of overlap is within, and we'd like to have all matches, only if there is a match.

NB: foverlaps uses data.table's binary search feature (along with roll where necessary) under the hood, but some function arguments (types of overlaps, maxgap, minoverlap etc..) are inspired by the function findOverlaps() from the Bioconductor package IRanges, an excellent package (and so is GenomicRanges, which extends IRanges for Genomics).


So what's the advantage?

A benchmark on the code above on your data results in foverlaps() slower than Gabor's answer (Timings: Gabor's data.table solution = 0.004 vs foverlaps = 0.021 seconds). But does it really matter at this granularity?

What would be really interesting is to see how well it scales - in terms of both speed and memory. In Gabor's answer, we join based on the key column x. And then filter the results.

What if d1 has about 40K rows and d2 has a 100K rows (or more)? For each row in d2 that matches x in d1, all those rows will be matched and returned, only to be filtered later. Here's an example of your Q scaled only slightly:

Generate data:

require(data.table)
set.seed(1L)
n = 20e3L; k = 100e3L
idx1 = sample(100, n, TRUE)
idx2 = sample(100, n, TRUE)
d1 = data.table(x = sample(letters[1:5], n, TRUE), 
                start = pmin(idx1, idx2), 
                end = pmax(idx1, idx2))

d2 = data.table(x = sample(letters[1:15], k, TRUE), 
                pos1 = sample(60:150, k, TRUE))

foverlaps:

system.time({
    setkey(d1)
    d2[, pos2 := pos1]
    ans1 = foverlaps(d2, d1, by.x=1:3, type="within", nomatch=0L)
})
# user  system elapsed 
#   3.028   0.635   3.745 

This took ~ 1GB of memory in total, out of which ans1 is 420MB. Most of the time spent here is on subset really. You can check it by setting the argument verbose=TRUE.

Gabor's solutions:

## new session - data.table solution
system.time({
    setkey(d1, x)
    ans2 <- d1[d2, allow.cartesian=TRUE, nomatch=0L][between(pos1, start, end)]
})
#   user  system elapsed 
# 15.714   4.424  20.324 

And this took a total of ~3.5GB.

I just noted that Gabor already mentions the memory required for intermediate results. So, trying out sqldf:

# new session - sqldf solution
system.time(ans3 <- sqldf("select * from d1 join 
            d2 using (x) where pos1 between start and end"))
#   user  system elapsed 
# 73.955   1.605  77.049 

Took a total of ~1.4GB. So, it definitely uses less memory than the one shown above.

[The answers were verified to be identical after removing pos2 from ans1 and setting key on both answers.]

Note that this overlap join is designed with problems where d2 doesn't necessarily have identical start and end coordinates (ex: genomics, the field where I come from, where d2 is usually about 30-150 million or more rows).


foverlaps() is stable, but is still under development, meaning some arguments and names might get changed.

NB: Since I mentioned GenomicRanges above, it is also perfectly capable of solving this problem. It uses interval trees under the hood, and is quite memory efficient as well. In my benchmarks on genomics data, foverlaps() is faster. But that's for another (blog) post, some other time.

这篇关于与开始和结束位置重叠连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆