如何在时间列上组合基于R数据帧的约束 [英] How to combine R dataframes based constraints on a time column
问题描述
我有两个R表,每个R表都有一个用户列表和一个与他们采取特定操作的时间相对应的时间戳。
I have two R tables, each with a list of users and a timestamp corresponding to the time that they took a certain action.
这两个表( df1
)中的第一个表有详尽的用户列表,并且用户将有多个具有不同时间戳的行。
The first of these (df1
) two tables has an exhaustive list of the users, and users will have multiple rows with different timestamps.
第二个( df2
)的用户列表将更为有限,但是用户仍在表中
The second (df2
) will have a more limited list of users, but again users will be in the table multiple times with different timestamps.
我想做的是将两个表连接起来,最后得到一个与<$ c中的用户匹配的表$ c> df1 在 df2
中具有最接近的时间戳,只要在 df2
发生在之后 df1
中的一个。
What I'd like to be able to do is join the two tables and end up with a table that matched the user in df1
with the closest timestamp in df2
, as long as the timestamp in df2
happened after the one in df1
.
例如,如果我有两个像这样的表:
For example, if I had two tables like:
df1 <- data.frame(c(1,1,2,3), as.POSIXct(c('2016-12-01 08:53:20', '2016-12-01 12:45:47', '2016-12-01 15:34:54', '2016-12-01 00:49:50')))
names(df1) <- c('user', 'time')
df2 <- data.frame(c(1,1,3), as.POSIXct(c('2016-12-01 07:11:01', '2016- 12-01 11:50:11', '2016-12-01 01:19:10')))
names(df2) <- c('user', 'time')
我们:
> df1
user time
1 1 2016-12-01 08:53:20
2 1 2016-12-01 12:45:47
3 2 2016-12-01 15:34:54
4 3 2016-12-01 00:49:50
> df2
user time
1 1 2016-12-01 07:11:01
2 1 2016-12-01 11:50:11
3 3 2016-12-01 01:19:10
我希望得到的输出看起来像:
The output I'd hope to get to would look like:
user time_1 time_2
1 2016-12-01 08:53:20 2016-12-01 11:50:11
1 2016-12-01 12:45:47 NA
2 2016-12-01 15:34:54 NA
3 2016-12-01 00:49:50 2016-12-01 01:19:10
我很困在使此联接生效的路上。作为额外的复杂性层,如果有一个参数控制时间窗口以允许匹配(例如,仅连接 df2
中的行,则我会喜欢它) X
分钟的 df1
),但这确实是主要问题的次要问题。
I'm pretty stuck on the way to get this join to work out. As an additional layer of complexity I'd love it if there was a parameter that controlled the time window to allow a match (ie only join the row from df2
if it is within X
minutes of df1
), but really this is secondary to the main problem.
推荐答案
第1部分-原始问题
问题的第一部分可用 sqldf
软件包。
library(sqldf)
df3 <- sqldf("SELECT * FROM df1 a
LEFT JOIN df2 b ON a.time < b.time
AND a.user = b.user")[,c(1:2, 4)]
#rename to match OP post
names(df3) <- c("user", "time_1", "time_2")
> df3
user time_1 time_2
1 1 2016-12-01 08:53:20 2016-12-01 11:50:11
2 1 2016-12-01 12:45:47 <NA>
3 2 2016-12-01 15:34:54 <NA>
4 3 2016-12-01 00:49:50 2016-12-01 01:19:10
第2部分-时间窗口
如果您希望允许时间窗口进行匹配,则可以在内减去秒SQL
语句如下:
df3 <- sqldf("SELECT * FROM df1 a
LEFT JOIN df2 b ON a.time < (b.time - 10000)
AND a.user = b.user")[,c(1:2, 4)]
> df3
user time time.1
1 1 2016-12-01 08:53:20 2016-12-01 11:50:11
2 1 2016-12-01 12:45:47 <NA>
3 2 2016-12-01 15:34:54 <NA>
4 3 2016-12-01 00:49:50 <NA>
注意,无论您从 b.time
将在 秒 之内。
Note, whatever you select from b.time
will be in seconds.
这篇关于如何在时间列上组合基于R数据帧的约束的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!