查找间隔集之间的重叠/有效重叠连接 [英] Finding Overlaps between interval sets / Efficient Overlap Joins

查看:157
本文介绍了查找间隔集之间的重叠/有效重叠连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

概述:



我需要连接两个表格:



/ code>包含时间间隔(从 t1 t2 )以及 id ,每个间隔都有一个空格



map 包含每个具有结果的时间间隔( t1 t2 res 及其相应的空格



获取/加入 ref 中的之间的 map 的所有间隔(及其分数) c> ref



示例:



  ref  
map< - data.table(space = rep('nI',241),t1 = seq(0,1200,by = 5),t2 = seq(5,1205,by = 5),res = rnorm (241))

它们看起来像:

 > ref 
space t1 t2 id
1:nI 100 150 a
2:nI 300 400 b
3:nI 500 600 c

> map
space t1 t2 res
1:nI 0 5 -0.7082922
2:nI 5 10 1.8251041
3:nI 10 15 0.2076552
4:nI 15 20 0.8047347
5:nI 20 25 2.3388920
---
237:nI 1180 1185 1.0229284
238:nI 1185 1190 -0.3657815
239:nI 1190 1195 0.3013489
240:nI 1195 1200 1.2947271
241:nI 1200 1205 -1.5050221



)解决方案




  • ?data.table :: foverlaps



我需要加入所有映射 c>in ref 的间隔,我不感兴趣的间隔不匹配这个键所以使用 nomatch = 0L

  setkey(ref,space,t1,t2)
b $ b foverlaps(map,ref,type =within,nomatch = 0L)

  space t1 t2 id i.t1 i.t2 res 
1:nI 100 150 a 100 105 -0.85202726
2:nI 100 150 a 105 110 0.79748876
3:nI 100 150 a 110 115 1.49894097
4:nI 100 150 a 115 120 0.47719957
5:nI 100 150 a 120 125 -0.95767896
6:nI 100 150 a 125 130 -0.51054673
7:nI 100 150 a 130 135 -0.08478700
8:nI 100 150 a 135 140 -0.69526566
9 :nI 100 150 a 140 145 2.14917623
10:nI 100 150 a 145 150 -0.05348163
11:nI 300 400 b 300 305 0.28834548
12:nI 300 400 b 305 310 0.32449616
13:nI 300 400 b 310 315 1.16107248
14:nI 300 400 b 315 320 1.08550676
15:nI 300 400 b 320 325 0.84640788
16:nI 300 400 b 325 330 - 2.15485447
17:nI 300 400 b 330 335 1.59115714
18:nI 300 400 b 335 340 -0.57588128
19:nI 300 400 b 340 345 0.23957563
20:nI 300 400 b 345 350 -0.60824259
21:nI 300 400 b 350 355 -0.84828189
22:nI 300 400 b 355 360 -0.43528701
23:nI 300 400 b 360 365 -0.80026281
24:nI 300 400 b 365 370 -0.62914234
25:nI 300 400 b 370 375 -0.83485164
26:nI 300 400 b 375 380 1.46922713
27:nI 300 400 b 380 385 -0.53965310
28:nI 300 400 b 385 390 0.98728765
29:nI 300 400 b 390 395 -0.66328893
30:nI 300 400 b 395 400 -0.08182384
31:nI 500 600 c 500 505 0.72566100
32:nI 500 600 c 505 510 2.27878366
33:nI 500 600 c 510 515 0.72974139
34:nI 500 600 c 515 520 -0.35358019
35:nI 500 600 c 520 525 -1.20697646
36:nI 500 600 c 525 530 -0.01719057
37:nI 500 600 c 530 535 0.06686472
38:nI 500 600 c 535 540 - 0.40866088
39:nI 500 600 c 540 545 -1.02697573
40:nI 500 600 c 545 550 2.19822065
41:nI 500 600 c 550 555 0.57075648
42:nI 500 600 c 555 560 -0.52009726
43:nI 500 600 c 560 565 -1.82999177
44:nI 500 600 c 565 570 2.53776578
45:nI 500 600 c 570 575 0.85626293
46 :nI 500 600 c 575 580 -0.34245708
47:nI 500 600 c 580 585 1.21679869
48:nI 500 600 c 585 590 1.87587020
49:nI 500 600 c 590 595 -0.23325264
50:nI 500 600 c 595 600 0.18845022
空间t1 t2 id i.t1 i.t2 res


解决方案

Ha,nice timing :)。就在几天前,实现了重叠连接(或间隔连接)。 in data.table函数为 foverlaps(),可从 github项目页面。请务必查看?foverlaps

  setkey ,space,t1,t2)
foverlaps(map,ref,type =within,nomatch = 0L)


b $ b

我认为这是你所追求的。这将导致连接结果只在有匹配的地方,它会检查 t1,t2 ref 空格标识符中的映射。如果你不想要,只需删除 space 。如果你想要所有的匹配,删除 nomatch = 0L - 默认是 nomatch = NA p>

函数是新的(但已经过严格测试),因此不是完整的。如果您有任何改进建议或遇到任何问题,请随时提出问题。


Overview:

I need to join two tables:

ref contains the time intervals (from t1 to t2) along with an id for each interval and a space where this interval occurs.

map contains time intervals (t1 to t2) each with a result res and its corresponding space.

I wish to obtain/join all intervals (and their score) of map that fall within the intervals in ref onto ref.

example:

ref <- data.table(space=rep('nI',3),t1=c(100,300,500),t2=c(150,400,600),id=letters[1:3])

map <- data.table(space=rep('nI',241),t1=seq(0,1200,by=5),t2=seq(5,1205,by=5),res=rnorm(241))

they look like:

> ref
space  t1  t2 id
1:    nI 100 150  a
2:    nI 300 400  b
3:    nI 500 600  c

> map
space   t1   t2        res
1:    nI    0    5 -0.7082922
2:    nI    5   10  1.8251041
3:    nI   10   15  0.2076552
4:    nI   15   20  0.8047347
5:    nI   20   25  2.3388920
---                           
237:    nI 1180 1185  1.0229284
238:    nI 1185 1190 -0.3657815
239:    nI 1190 1195  0.3013489
240:    nI 1195 1200  1.2947271
241:    nI 1200 1205 -1.5050221

(UPDATE) Solution

  • ?data.table::foverlaps is the key here.

I need to join all the map intervals that occur "within" the intervals of ref and I am not interested in intervals that do not match this key so use nomatch=0L.

setkey(ref,space,t1,t2)

foverlaps(map,ref,type="within",nomatch=0L)

which gives:

space  t1  t2 id i.t1 i.t2         res
1:    nI 100 150  a  100  105 -0.85202726
2:    nI 100 150  a  105  110  0.79748876
3:    nI 100 150  a  110  115  1.49894097
4:    nI 100 150  a  115  120  0.47719957
5:    nI 100 150  a  120  125 -0.95767896
6:    nI 100 150  a  125  130 -0.51054673
7:    nI 100 150  a  130  135 -0.08478700
8:    nI 100 150  a  135  140 -0.69526566
9:    nI 100 150  a  140  145  2.14917623
10:    nI 100 150  a  145  150 -0.05348163
11:    nI 300 400  b  300  305  0.28834548
12:    nI 300 400  b  305  310  0.32449616
13:    nI 300 400  b  310  315  1.16107248
14:    nI 300 400  b  315  320  1.08550676
15:    nI 300 400  b  320  325  0.84640788
16:    nI 300 400  b  325  330 -2.15485447
17:    nI 300 400  b  330  335  1.59115714
18:    nI 300 400  b  335  340 -0.57588128
19:    nI 300 400  b  340  345  0.23957563
20:    nI 300 400  b  345  350 -0.60824259
21:    nI 300 400  b  350  355 -0.84828189
22:    nI 300 400  b  355  360 -0.43528701
23:    nI 300 400  b  360  365 -0.80026281
24:    nI 300 400  b  365  370 -0.62914234
25:    nI 300 400  b  370  375 -0.83485164
26:    nI 300 400  b  375  380  1.46922713
27:    nI 300 400  b  380  385 -0.53965310
28:    nI 300 400  b  385  390  0.98728765
29:    nI 300 400  b  390  395 -0.66328893
30:    nI 300 400  b  395  400 -0.08182384
31:    nI 500 600  c  500  505  0.72566100
32:    nI 500 600  c  505  510  2.27878366
33:    nI 500 600  c  510  515  0.72974139
34:    nI 500 600  c  515  520 -0.35358019
35:    nI 500 600  c  520  525 -1.20697646
36:    nI 500 600  c  525  530 -0.01719057
37:    nI 500 600  c  530  535  0.06686472
38:    nI 500 600  c  535  540 -0.40866088
39:    nI 500 600  c  540  545 -1.02697573
40:    nI 500 600  c  545  550  2.19822065
41:    nI 500 600  c  550  555  0.57075648
42:    nI 500 600  c  555  560 -0.52009726
43:    nI 500 600  c  560  565 -1.82999177
44:    nI 500 600  c  565  570  2.53776578
45:    nI 500 600  c  570  575  0.85626293
46:    nI 500 600  c  575  580 -0.34245708
47:    nI 500 600  c  580  585  1.21679869
48:    nI 500 600  c  585  590  1.87587020
49:    nI 500 600  c  590  595 -0.23325264
50:    nI 500 600  c  595  600  0.18845022
space  t1  t2 id i.t1 i.t2         res

解决方案

Ha, nice timing :). Just a few days back, overlap joins (or interval joins) was implemented. in data.table The function is foverlaps() and is available from the github project page. Make sure to have a look at ?foverlaps.

setkey(ref, space, t1, t2)
foverlaps(map, ref, type="within", nomatch=0L)

I think this is what you're after. This'll result in the join result only where there's a match, and it'll check for t1,t2 overlaps between ref and map within space identifier.. If you don't want that, just remove space from the key column. And if you want all matches, remove nomatch=0L - the default is nomatch=NA which returns all.

The function is new (but has been rigorously tested) and is therefore not feature complete. If you've any suggestions for improvement or come across any issues, please feel free to file an issue.

这篇关于查找间隔集之间的重叠/有效重叠连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆