查找区间集之间的重叠/有效的重叠连接 [英] Finding Overlaps between interval sets / Efficient Overlap Joins

查看:20
本文介绍了查找区间集之间的重叠/有效的重叠连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要加入两个表:

ref 包含时间间隔(从 t1t2)以及每个间隔的 id 和一个空格,这个间隔出现的地方.

ref contains the time intervals (from t1 to t2) along with an id for each interval and a space where this interval occurs.

map 包含时间间隔(t1t2),每个时间间隔都有一个结果 res 及其对应的 空间.

map contains time intervals (t1 to t2) each with a result res and its corresponding space.

我希望获得/加入 map 的所有区间(及其分数),这些区间属于 ref 中的区间到 ref.

I wish to obtain/join all intervals (and their score) of map that fall within the intervals in ref onto ref.

ref <- data.table(space=rep('nI',3),t1=c(100,300,500),t2=c(150,400,600),id=letters[1:3])

map <- data.table(space=rep('nI',241),t1=seq(0,1200,by=5),t2=seq(5,1205,by=5),res=rnorm(241))

它们看起来像:

> ref
space  t1  t2 id
1:    nI 100 150  a
2:    nI 300 400  b
3:    nI 500 600  c

> map
space   t1   t2        res
1:    nI    0    5 -0.7082922
2:    nI    5   10  1.8251041
3:    nI   10   15  0.2076552
4:    nI   15   20  0.8047347
5:    nI   20   25  2.3388920
---                           
237:    nI 1180 1185  1.0229284
238:    nI 1185 1190 -0.3657815
239:    nI 1190 1195  0.3013489
240:    nI 1195 1200  1.2947271
241:    nI 1200 1205 -1.5050221

(更新)解决方案

  • ?data.table::foverlaps 是这里的关键.
  • (UPDATE) Solution

    • ?data.table::foverlaps is the key here.
    • 我需要加入所有 map 发生的区间 "within" ref 的区间,我对发生的区间不感兴趣不匹配这个键,所以使用 nomatch=0L.

      I need to join all the map intervals that occur "within" the intervals of ref and I am not interested in intervals that do not match this key so use nomatch=0L.

      setkey(ref,space,t1,t2)
      
      foverlaps(map,ref,type="within",nomatch=0L)
      

      给出:

      space  t1  t2 id i.t1 i.t2         res
      1:    nI 100 150  a  100  105 -0.85202726
      2:    nI 100 150  a  105  110  0.79748876
      3:    nI 100 150  a  110  115  1.49894097
      4:    nI 100 150  a  115  120  0.47719957
      5:    nI 100 150  a  120  125 -0.95767896
      6:    nI 100 150  a  125  130 -0.51054673
      7:    nI 100 150  a  130  135 -0.08478700
      8:    nI 100 150  a  135  140 -0.69526566
      9:    nI 100 150  a  140  145  2.14917623
      10:    nI 100 150  a  145  150 -0.05348163
      11:    nI 300 400  b  300  305  0.28834548
      12:    nI 300 400  b  305  310  0.32449616
      13:    nI 300 400  b  310  315  1.16107248
      14:    nI 300 400  b  315  320  1.08550676
      15:    nI 300 400  b  320  325  0.84640788
      16:    nI 300 400  b  325  330 -2.15485447
      17:    nI 300 400  b  330  335  1.59115714
      18:    nI 300 400  b  335  340 -0.57588128
      19:    nI 300 400  b  340  345  0.23957563
      20:    nI 300 400  b  345  350 -0.60824259
      21:    nI 300 400  b  350  355 -0.84828189
      22:    nI 300 400  b  355  360 -0.43528701
      23:    nI 300 400  b  360  365 -0.80026281
      24:    nI 300 400  b  365  370 -0.62914234
      25:    nI 300 400  b  370  375 -0.83485164
      26:    nI 300 400  b  375  380  1.46922713
      27:    nI 300 400  b  380  385 -0.53965310
      28:    nI 300 400  b  385  390  0.98728765
      29:    nI 300 400  b  390  395 -0.66328893
      30:    nI 300 400  b  395  400 -0.08182384
      31:    nI 500 600  c  500  505  0.72566100
      32:    nI 500 600  c  505  510  2.27878366
      33:    nI 500 600  c  510  515  0.72974139
      34:    nI 500 600  c  515  520 -0.35358019
      35:    nI 500 600  c  520  525 -1.20697646
      36:    nI 500 600  c  525  530 -0.01719057
      37:    nI 500 600  c  530  535  0.06686472
      38:    nI 500 600  c  535  540 -0.40866088
      39:    nI 500 600  c  540  545 -1.02697573
      40:    nI 500 600  c  545  550  2.19822065
      41:    nI 500 600  c  550  555  0.57075648
      42:    nI 500 600  c  555  560 -0.52009726
      43:    nI 500 600  c  560  565 -1.82999177
      44:    nI 500 600  c  565  570  2.53776578
      45:    nI 500 600  c  570  575  0.85626293
      46:    nI 500 600  c  575  580 -0.34245708
      47:    nI 500 600  c  580  585  1.21679869
      48:    nI 500 600  c  585  590  1.87587020
      49:    nI 500 600  c  590  595 -0.23325264
      50:    nI 500 600  c  595  600  0.18845022
      space  t1  t2 id i.t1 i.t2         res
      

      推荐答案

      哈,好时机 :).就在几天前,实现了重叠连接(或间隔连接).data.table 中的函数是 foverlaps(),可从 github 项目页面获得.请务必查看 ?foverlaps.

      Ha, nice timing :). Just a few days back, overlap joins (or interval joins) was implemented. in data.table The function is foverlaps() and is available from the github project page. Make sure to have a look at ?foverlaps.

      setkey(ref, space, t1, t2)
      foverlaps(map, ref, type="within", nomatch=0L)
      

      我想这就是你所追求的.这将导致仅在有匹配项的地方产生连接结果,并且它会检查 refmap 之间的 t1,t2 重叠space 标识符.. 如果你不想这样,只需从键列中删除 space .如果你想要所有匹配,删除 nomatch=0L - 默认是 nomatch=NA ,它返回所有.

      I think this is what you're after. This'll result in the join result only where there's a match, and it'll check for t1,t2 overlaps between ref and map within space identifier.. If you don't want that, just remove space from the key column. And if you want all matches, remove nomatch=0L - the default is nomatch=NA which returns all.

      该功能是新功能(但已经过严格测试),因此功能不完整.如果您有任何改进建议或遇到任何问题,请随时提出问题.

      The function is new (but has been rigorously tested) and is therefore not feature complete. If you've any suggestions for improvement or come across any issues, please feel free to file an issue.

      这篇关于查找区间集之间的重叠/有效的重叠连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆