如何使用 data.table 在日期范围内执行连接? [英] How to perform join over date ranges using data.table?

查看:8
本文介绍了如何使用 data.table 在日期范围内执行连接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用 data.table 执行以下操作(直接使用 sqldf)并获得完全相同的结果:

库(data.table)whatWasMeasured <- data.table(start=as.POSIXct(seq(1, 1000, 100),起源="1970-01-01 00:00:00"),end=as.POSIXct(seq(10, 1000, 100), origin="1970-01-01 00:00:00"),x=1:10,y=字母[1:10])测量值 <- data.table(time=as.POSIXct(seq(1, 2000, 1),起源="1970-01-01 00:00:00"),temp=runif(2000, 10, 100))## data.tables 的替代简称dt1 <- whatWasMeasureddt2 <- 测量值## 直接使用 sqldf库(sqldf)sqldf("从测量值 m 中选择 *,whatWasMeasured wwmwwm.start 和 wwm.end 之间的 m.time")

解决方案

您可以使用 foverlaps() 函数,它可以有效地实现区间连接.在您的情况下,我们只需要一个用于 measurments 的虚拟列.

<块引用>

注意 1: 你应该安装 data.table 的开发版本 - v1.9.5 因为 foverlaps() 的错误已被固定在那里.您可以在这里找到安装说明.

注意 2: 我将调用 whatWasMeasured = dt1measurments = dt2 这里是为了方便.

require(data.table) ## 1.9.5+dt2[, 虚拟 := 时间]设置键(dt1,开始,结束)ans = foverlaps(dt2, dt1, by.x=c("time", "dummy"), nomatch=0L)[, dummy := NULL]

查看?foverlaps了解更多信息,查看这篇文章了解性能比较.p>

How to do the below (straightforward using sqldf) using data.table and get exact same result:

library(data.table)

whatWasMeasured <- data.table(start=as.POSIXct(seq(1, 1000, 100),
    origin="1970-01-01 00:00:00"),
    end=as.POSIXct(seq(10, 1000, 100), origin="1970-01-01 00:00:00"),
    x=1:10,
    y=letters[1:10])

measurments <- data.table(time=as.POSIXct(seq(1, 2000, 1),
    origin="1970-01-01 00:00:00"),
    temp=runif(2000, 10, 100))

## Alternative short names for data.tables
dt1 <- whatWasMeasured
dt2 <- measurments

## Straightforward with sqldf    
library(sqldf)

sqldf("select * from measurments m, whatWasMeasured wwm
where m.time between wwm.start and wwm.end")

解决方案

You can use the foverlaps() function which implements joins over intervals efficiently. In your case, we just need a dummy column for measurments.

Note 1: You should install the development version of data.table - v1.9.5 as a bug with foverlaps() has been fixed there. You can find the installation instructions here.

Note 2: I'll call whatWasMeasured = dt1 and measurments = dt2 here for convenience.

require(data.table) ## 1.9.5+
dt2[, dummy := time]

setkey(dt1, start, end)
ans = foverlaps(dt2, dt1, by.x=c("time", "dummy"), nomatch=0L)[, dummy := NULL]

See ?foverlaps for more info and this post for a performance comparison.

这篇关于如何使用 data.table 在日期范围内执行连接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆