如何使用 data.table 在日期范围内执行连接? [英] How to perform join over date ranges using data.table?
问题描述
如何使用 data.table 执行以下操作(直接使用 sqldf)并获得完全相同的结果:
库(data.table)whatWasMeasured <- data.table(start=as.POSIXct(seq(1, 1000, 100),起源="1970-01-01 00:00:00"),end=as.POSIXct(seq(10, 1000, 100), origin="1970-01-01 00:00:00"),x=1:10,y=字母[1:10])测量值 <- data.table(time=as.POSIXct(seq(1, 2000, 1),起源="1970-01-01 00:00:00"),temp=runif(2000, 10, 100))## data.tables 的替代简称dt1 <- whatWasMeasureddt2 <- 测量值## 直接使用 sqldf库(sqldf)sqldf("从测量值 m 中选择 *,whatWasMeasured wwmwwm.start 和 wwm.end 之间的 m.time")
您可以使用 foverlaps()
函数,它可以有效地实现区间连接.在您的情况下,我们只需要一个用于 measurments
的虚拟列.
注意 1: 你应该安装 data.table 的开发版本 - v1.9.5
因为 foverlaps()
的错误已被固定在那里.您可以在这里找到安装说明.
注意 2: 我将调用 whatWasMeasured
= dt1
和 measurments
= dt2代码> 这里是为了方便.
require(data.table) ## 1.9.5+dt2[, 虚拟 := 时间]设置键(dt1,开始,结束)ans = foverlaps(dt2, dt1, by.x=c("time", "dummy"), nomatch=0L)[, dummy := NULL]
查看?foverlaps
了解更多信息,查看这篇文章了解性能比较.p>
How to do the below (straightforward using sqldf) using data.table and get exact same result:
library(data.table)
whatWasMeasured <- data.table(start=as.POSIXct(seq(1, 1000, 100),
origin="1970-01-01 00:00:00"),
end=as.POSIXct(seq(10, 1000, 100), origin="1970-01-01 00:00:00"),
x=1:10,
y=letters[1:10])
measurments <- data.table(time=as.POSIXct(seq(1, 2000, 1),
origin="1970-01-01 00:00:00"),
temp=runif(2000, 10, 100))
## Alternative short names for data.tables
dt1 <- whatWasMeasured
dt2 <- measurments
## Straightforward with sqldf
library(sqldf)
sqldf("select * from measurments m, whatWasMeasured wwm
where m.time between wwm.start and wwm.end")
You can use the foverlaps()
function which implements joins over intervals efficiently. In your case, we just need a dummy column for measurments
.
Note 1: You should install the development version of data.table -
v1.9.5
as a bug withfoverlaps()
has been fixed there. You can find the installation instructions here.Note 2: I'll call
whatWasMeasured
=dt1
andmeasurments
=dt2
here for convenience.
require(data.table) ## 1.9.5+
dt2[, dummy := time]
setkey(dt1, start, end)
ans = foverlaps(dt2, dt1, by.x=c("time", "dummy"), nomatch=0L)[, dummy := NULL]
See ?foverlaps
for more info and this post for a performance comparison.
这篇关于如何使用 data.table 在日期范围内执行连接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!