滚动连接与开始/结束窗口 [英] roll join with start/end window

查看:139
本文介绍了滚动连接与开始/结束窗口的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请考虑以下 data.table s。第一个定义了每个组的开始和结束位置的一组区域

  library(data.table)

d1 < - data.table(x = letters [1:5],start = c(1,5,19,30,7),end = c(3,11,22,39,25))
setkey(d1,x,start)

#x start end
#1:a 1 3
#2:b 5 11
#3 :c 19 22
#4:d 30 39
#5:e 7 25

第二个表示每个组的观察

  d2 < -  data.table(x = letters [c 1,2,2,3:5)],pos = c(2,3,3,12,20,52,10))
setkey(d2,x,pos)

#x pos
#1:a 2
#2:a 3
#3:b 3
#4:b 12
#5:c 20
#6:d 52
#7:e 10

喜欢能够提取d2中的行,这些行位于d1中匹配的 x 值的区域中。所需的结果是

 #x pos start end 
#1:a 2 1 3
#2 :a 3 1 3
#3:c 20 19 22
#4:e 10 7 25


b $ b

任何组 x 的开始/结束位置不会重叠,但可能存在不在任何区域中的值。



现在,我相信我应该使用滚动连接。



我试过了

  d1 [d2,roll = T,nomatch = 0,mult =all] [start< = end] 

并获得

 #x start end 
#1 :a 2 3
#2:a 3 3
#3:c 20 22
#4:e 10 25

这是我想要的正确的一组行;但是pos已经成为开始,原来的开始已经丢失。有没有办法保留所有列与滚动连接,所以我可以报告开始,pos,结束根据需要?

解决方案

重叠连接已使用 commit 1375 data.table v1.9.3 ,并且可以在当前稳定版本,v1.9.4 。该函数称为 foverlaps 。从新闻


29)重叠连接 #528 现在在这里,终于!除了 type =equal maxgap minoverlap 参数,一切都实现。查看?foverlaps 及其使用示例。这是 data.table 之外的一个主要特性。


,定义为 [a,b] 的间隔,其中 a <= b [c,d] ,其中 c <= d 。间隔y被说成是重叠 x,iff d> = a c <= b 1 。 y完全包含在 x内, a <= c,d <= b 2 。对于实施的不同类型的重叠,请查看?foverlaps



您的问题是一种特殊情况重叠连接:在 d1 中,您具有 start end 仓位。另一方面,在 d2 中,只有位置( pos ),而不是间隔。为了能够进行重叠连接,我们还需要在 d2 中创建间隔。这是通过创建一个额外的变量 pos2 来实现的,它与 pos d2 [ ,pos2:= pos] )。因此,现在我们在 d2 中有一个区间,尽管具有相同的开始结束坐标。然后可以在 foverlap 中使用 d2 中的'虚拟零宽度区间' code> d1

  require(data.table)## 1.9.3 
setkey(d1)
d2 [,pos2:= pos]
foverlaps(d2,d1,by.x = names(d2),type =within,mult =all nomatch = 0L)
#x start end pos pos2
#1:a 1 3 2 2
#2:a 1 3 3 3
#3:c 19 22 20 20
#4:e 7 25 10 10

by.y 默认为 key(y),因此我们跳过它。 by.x 默认情况下 key(x)如果存在,键(y)。但是 d2 不存在键,我们不能设置 y 的列,因为他们没有相同的名称。因此,我们显式地设置 by.x 。



重叠类型 code> foverlaps 使用data.table的二进制搜索功能(必要时还包括 roll ),但是一些函数参数的重叠,maxgap,minoverlap等)的灵感来自Bioconductor包中的函数 findOverlaps() IRanges ,一个优秀的包(因此 GenomicRanges ,扩展 IRanges 用于Genomics)。






那么优点是什么?



代码上面你的数据结果 foverlaps()慢于Gabor的答案(时间:Gabor的data.table解决方案= 0.004 vs foverlaps = 0.021秒)。但是在这个粒度上真的很重要吗?



真正有趣的是看它是如何缩放 - 在速度 内存。在Gabor的答案中,我们基于关键列 x 加入。 然后过滤结果。



如果 d1 有大约40K行和 d2 有100K行(或更多)?对于 d2 中匹配 x d1 / code>,全部这些行将被匹配和返回,只有稍后被过滤。以下是您的Q比例示例:



生成数据:



  require(data.table)
set.seed(1L)
n = 20e3L; k = 100e3L
idx1 = sample(100,n,TRUE)
idx2 = sample(100,n,TRUE)
d1 = data.table ],
$ b d2 = data.table(x = sample)(bx,idx2) (字母[1:15],k,TRUE),
pos1 = sample(60:150,k,TRUE))



foverlaps:



  system.time({
setkey(d1)
d2 [,pos2:= pos1]
ans1 = foverlaps(d2,d1,by.x = 1:3,type =within,nomatch = 0L)
})
#user system elapsed
#3.028 0.635 3.745

总共需要大约1GB的内存,其中 ans1 为420MB。在这里花费的大部分时间是在子集上。您可以通过设置参数 verbose = TRUE 来检查它。



Gabor的解决方案:



  ##新会话 -  data.table解决方案
system.time({
setkey(d1,x)
ans2 < -d1 [d2,allow.cartesian = TRUE,nomatch = 0L] [between(pos1,start,end)]
})
#用户系统已过
#15.714 4.424 20.324

这总共需要〜3.5GB。



我刚刚注意到Gabor已经提到了中间结果所需的内存。所以,尝试 sqldf

 #new session  -  sqldf solution 
system.time(ans3 < - sqldf(select * from d1 join
d2 using(x)where pos1 between start and end))
#用户系统已过
# 73.955 1.605 77.049

总共〜1.4GB。



[删除 pos2后,验证了相同的答案 ans1 并在两个答案上设置键。]

请注意,此重叠连接设计有问题, d2 不一定具有相同的开始和结束坐标(例如:基因组学,我来自的字段,其中 d2 通常约为30-150万行或更多行)。






foverlaps()是稳定的,但仍在开发中,这意味着一些参数和名称可能会改变。



注意:由于我提到了 GenomicRanges 能够解决这个问题。它使用间隔树 ,并且也是相当有记忆效率的。在我的基因组数据基准中, foverlaps()更快。但这是另一个(博客)的帖子,其他时间。


Consider the following data.tables. The first defines a set of regions with start and end positions for each group

library(data.table)

d1 <- data.table(x=letters[1:5], start=c(1,5,19,30, 7), end=c(3,11,22,39,25))
setkey(d1, x,start)

#    x start end
# 1: a     1   3
# 2: b     5  11
# 3: c    19  22
# 4: d    30  39
# 5: e     7  25

The second represents observations for each group

d2 <- data.table(x=letters[c(1,1,2,2,3:5)], pos=c(2,3,3,12,20,52,10))
setkey(d2, x,pos)

#    x pos
# 1: a   2
# 2: a   3
# 3: b   3
# 4: b  12
# 5: c  20
# 6: d  52
# 7: e  10

Ultimately i'd like to be able to extract the rows in d2 that are in a region for the matching x value in d1. The desired result is

#    x pos start  end
# 1: a   2     1    3
# 2: a   3     1    3
# 3: c  20    19   22
# 4: e  10     7   25

The start/end positions for any group x will never overlap but there may be gaps of values not in any region.

Now, I believe I should be using a rolling join. From what i can tell, I cannot use the "end" column in the join.

I've tried

d1[d2, roll=T, nomatch=0, mult="all"][start<=end]

and got

#    x start end
# 1: a     2   3
# 2: a     3   3
# 3: c    20  22
# 4: e    10  25

which is the right set of rows I want; However "pos" has become "start" and the original "start" has been lost. Is there a way to preserve all the columns with the roll join so i could report "start","pos","end" as desired?

解决方案

Overlap joins was implemented with commit 1375 in data.table v1.9.3, and is available in the current stable release, v1.9.4. The function is called foverlaps. From NEWS:

29) Overlap joins #528 is now here, finally!! Except for type="equal" and maxgap and minoverlap arguments, everything else is implemented. Check out ?foverlaps and the examples there on its usage. This is a major feature addition to data.table.

Let's consider x, an interval defined as [a, b], where a <= b, and y, another interval defined as [c, d], where c <= d. The interval y is said to overlap x at all, iff d >= a and c <= b 1. And y is entirely contained within x, iff a <= c,d <= b 2. For the different types of overlaps implemented, please have a look at ?foverlaps.

Your question is a special case of an overlap join: in d1 you have true physical intervals with start and end positions. In d2 on the other hand, there are only positions (pos), not intervals. To be able to do an overlap join, we need to create intervals also in d2. This is achieved by creating an additional variable pos2, which is identical to pos (d2[, pos2 := pos]). Thus, we now have an interval in d2, albeit with identical start and end coordinates. This 'virtual, zero-width interval' in d2 can then be used in foverlap to do an overlap join with d1:

require(data.table) ## 1.9.3
setkey(d1)
d2[, pos2 := pos]
foverlaps(d2, d1, by.x = names(d2), type = "within", mult = "all", nomatch = 0L)
#    x start end pos pos2
# 1: a     1   3   2    2
# 2: a     1   3   3    3
# 3: c    19  22  20   20
# 4: e     7  25  10   10

by.y by default is key(y), so we skipped it. by.x by default takes key(x) if it exists, and if not takes key(y). But a key doesn't exist for d2, and we can't set the columns from y, because they don't have the same names. So, we set by.x explicitly.

The type of overlap is within, and we'd like to have all matches, only if there is a match.

NB: foverlaps uses data.table's binary search feature (along with roll where necessary) under the hood, but some function arguments (types of overlaps, maxgap, minoverlap etc..) are inspired by the function findOverlaps() from the Bioconductor package IRanges, an excellent package (and so is GenomicRanges, which extends IRanges for Genomics).


So what's the advantage?

A benchmark on the code above on your data results in foverlaps() slower than Gabor's answer (Timings: Gabor's data.table solution = 0.004 vs foverlaps = 0.021 seconds). But does it really matter at this granularity?

What would be really interesting is to see how well it scales - in terms of both speed and memory. In Gabor's answer, we join based on the key column x. And then filter the results.

What if d1 has about 40K rows and d2 has a 100K rows (or more)? For each row in d2 that matches x in d1, all those rows will be matched and returned, only to be filtered later. Here's an example of your Q scaled only slightly:

Generate data:

require(data.table)
set.seed(1L)
n = 20e3L; k = 100e3L
idx1 = sample(100, n, TRUE)
idx2 = sample(100, n, TRUE)
d1 = data.table(x = sample(letters[1:5], n, TRUE), 
                start = pmin(idx1, idx2), 
                end = pmax(idx1, idx2))

d2 = data.table(x = sample(letters[1:15], k, TRUE), 
                pos1 = sample(60:150, k, TRUE))

foverlaps:

system.time({
    setkey(d1)
    d2[, pos2 := pos1]
    ans1 = foverlaps(d2, d1, by.x=1:3, type="within", nomatch=0L)
})
# user  system elapsed 
#   3.028   0.635   3.745 

This took ~ 1GB of memory in total, out of which ans1 is 420MB. Most of the time spent here is on subset really. You can check it by setting the argument verbose=TRUE.

Gabor's solutions:

## new session - data.table solution
system.time({
    setkey(d1, x)
    ans2 <- d1[d2, allow.cartesian=TRUE, nomatch=0L][between(pos1, start, end)]
})
#   user  system elapsed 
# 15.714   4.424  20.324 

And this took a total of ~3.5GB.

I just noted that Gabor already mentions the memory required for intermediate results. So, trying out sqldf:

# new session - sqldf solution
system.time(ans3 <- sqldf("select * from d1 join 
            d2 using (x) where pos1 between start and end"))
#   user  system elapsed 
# 73.955   1.605  77.049 

Took a total of ~1.4GB. So, it definitely uses less memory than the one shown above.

[The answers were verified to be identical after removing pos2 from ans1 and setting key on both answers.]

Note that this overlap join is designed with problems where d2 doesn't necessarily have identical start and end coordinates (ex: genomics, the field where I come from, where d2 is usually about 30-150 million or more rows).


foverlaps() is stable, but is still under development, meaning some arguments and names might get changed.

NB: Since I mentioned GenomicRanges above, it is also perfectly capable of solving this problem. It uses interval trees under the hood, and is quite memory efficient as well. In my benchmarks on genomics data, foverlaps() is faster. But that's for another (blog) post, some other time.

这篇关于滚动连接与开始/结束窗口的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆