找到一种有效的方法来计算两个表中间隔集之间的重叠数量? [英] Finding an efficient way to count the number of overlaps between interval sets in two tables?

查看:122
本文介绍了找到一种有效的方法来计算两个表中间隔集之间的重叠数量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

注意:为方便起见,我使用了上一篇文章中的示例数据集。

假设有两个数据集, code> ref 和 map 。它们是:

Suppose that there are two data sets, ref and map. They are:

ref <- data.table(space=rep('nI',3),t1=c(100,300,500),t2=c(150,400,600),id=letters[1:3])

map <- data.table(space=rep('nI',241),t1=seq(0,1200,by=5),t2=seq(5,1205,by=5),res=rnorm(241))

它们看起来像:

> ref
   space t1  t2   id
1:    nI 100 150  a
2:    nI 300 400  b
3:    nI 500 600  c

> map
   space   t1   t2        res
1:    nI    0    5 -0.7082922
2:    nI    5   10  1.8251041
3:    nI   10   15  0.2076552
4:    nI   15   20  0.8047347
5:    nI   20   25  2.3388920
---                           
237:    nI 1180 1185  1.0229284
238:    nI 1185 1190 -0.3657815
239:    nI 1190 1195  0.3013489
240:    nI 1195 1200  1.2947271
241:    nI 1200 1205 -1.5050221

它已经引起我的注意,在仍在开发的data.table包中,函数 foverlaps 将填充 ref 映射中的相应行。

Now, it has come to my attention that in the data.table package that is still under development, the function foverlaps will fill in intervals in ref with the corresponding rows in map.

setkey(ref,space,t1,t2)

foverlaps(map,ref,type="within",nomatch=0L)

其中:

   space t1  t2  id i.t1 i.t2         res
1:    nI 100 150  a  100  105 -0.85202726
2:    nI 100 150  a  105  110  0.79748876
3:    nI 100 150  a  110  115  1.49894097
4:    nI 100 150  a  115  120  0.47719957
5:    nI 100 150  a  120  125 -0.95767896
6:    nI 100 150  a  125  130 -0.51054673
7:    nI 100 150  a  130  135 -0.08478700
8:    nI 100 150  a  135  140 -0.69526566
9:    nI 100 150  a  140  145  2.14917623
10:    nI 100 150  a  145  150 -0.05348163
11:    nI 300 400  b  300  305  0.28834548
12:    nI 300 400  b  305  310  0.32449616
13:    nI 300 400  b  310  315  1.16107248
14:    nI 300 400  b  315  320  1.08550676
15:    nI 300 400  b  320  325  0.84640788
16:    nI 300 400  b  325  330 -2.15485447
17:    nI 300 400  b  330  335  1.59115714
18:    nI 300 400  b  335  340 -0.57588128
19:    nI 300 400  b  340  345  0.23957563
20:    nI 300 400  b  345  350 -0.60824259
21:    nI 300 400  b  350  355 -0.84828189
22:    nI 300 400  b  355  360 -0.43528701
23:    nI 300 400  b  360  365 -0.80026281
24:    nI 300 400  b  365  370 -0.62914234
25:    nI 300 400  b  370  375 -0.83485164
26:    nI 300 400  b  375  380  1.46922713
27:    nI 300 400  b  380  385 -0.53965310
28:    nI 300 400  b  385  390  0.98728765
29:    nI 300 400  b  390  395 -0.66328893
30:    nI 300 400  b  395  400 -0.08182384
31:    nI 500 600  c  500  505  0.72566100
32:    nI 500 600  c  505  510  2.27878366
33:    nI 500 600  c  510  515  0.72974139
34:    nI 500 600  c  515  520 -0.35358019
35:    nI 500 600  c  520  525 -1.20697646
36:    nI 500 600  c  525  530 -0.01719057
37:    nI 500 600  c  530  535  0.06686472
38:    nI 500 600  c  535  540 -0.40866088
39:    nI 500 600  c  540  545 -1.02697573
40:    nI 500 600  c  545  550  2.19822065
41:    nI 500 600  c  550  555  0.57075648
42:    nI 500 600  c  555  560 -0.52009726
43:    nI 500 600  c  560  565 -1.82999177
44:    nI 500 600  c  565  570  2.53776578
45:    nI 500 600  c  570  575  0.85626293
46:    nI 500 600  c  575  580 -0.34245708
47:    nI 500 600  c  580  585  1.21679869
48:    nI 500 600  c  585  590  1.87587020
49:    nI 500 600  c  590  595 -0.23325264
50:    nI 500 600  c  595  600  0.18845022
space  t1  t2 id i.t1 i.t2         res

运行development.table 1.9.3的开发版本,以下代码将帮助您运行它:

to run the developement version of data.table 1.9.3, the following code will help you run it:

install.packages("devtools")
library(devtools)
dev_mode(on=T)

install_github("Rdatatable/data.table", build_vignettes=FALSE)

dev_mode(on=F)


b $ b

我试图做的事情:

上面基本上列出了时间间隔内包含的所有时间间隔。但是,我试图通过 ref 创建一个新列计数 map ,它们在 ref 的时间间隔内。因此,我想要的表是:

The above basically lists out all the intervals contained within the time intervals. However, I am trying to just create a new column by ref that counts the number of rows in map that are within the time intervals of ref. Hence, the table I would like is:

> ref
   space t1  t2  id  count
1:    nI 100 150  a     10
2:    nI 300 400  b     20
3:    nI 500 600  c     20

每个的计数表示映射 ref 的间隔。虽然我理解一个非常基本的解决方案是只使用一个和或计数功能,是否有一个解决方案将创建计数,而不必先创建更大的填充数据集?我说这是因为我的真实数据包含超过3亿的观察。任何建议将是非常有用的!谢谢!

The count for each indicates how many rows of map fell in between each time interval of ref. While I understand a very basic solution is to just use a sum or count function to count, is there a solution that would create the counts without having to first create the larger filled in dataset? I say this because my real data contains over 300 million observations. Any suggestions would be greatly helpful! Thank you!

推荐答案

您可以使用 which = TRUE 重叠的位置,然后通过进行简单的聚合,使用获取计数

You can use which=TRUE argument to get the position of overlaps, and then use to get the counts by doing a simple aggregation:

ans = foverlaps(map, ref, type="within", nomatch=0L, which=TRUE)[, .N, by=yid]
#    yid  N
# 1:   1 10
# 2:   2 20
# 3:   3 20

code> ref 。但我们应该提供一个更直接的方法来实现这一点。

And then get this back in ref. But we should provide a more direct way to accomplish this.

这篇关于找到一种有效的方法来计算两个表中间隔集之间的重叠数量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆