如何在时间列上组合基于R数据帧的约束 [英] How to combine R dataframes based constraints on a time column

查看:59
本文介绍了如何在时间列上组合基于R数据帧的约束的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个R表,每个R表都有一个用户列表和一个与他们采取特定操作的时间相对应的时间戳。

I have two R tables, each with a list of users and a timestamp corresponding to the time that they took a certain action.

这两个表( df1 )中的第一个表有详尽的用户列表,并且用户将有多个具有不同时间戳的行。

The first of these (df1) two tables has an exhaustive list of the users, and users will have multiple rows with different timestamps.

第二个( df2 )的用户列表将更为有限,但是用户仍在表中

The second (df2) will have a more limited list of users, but again users will be in the table multiple times with different timestamps.

我想做的是将两个表连接起来,最后得到一个与<$ c中的用户匹配的表$ c> df1 在 df2 中具有最接近的时间戳,只要在 df2 发生在之后 df1 中的一个。

What I'd like to be able to do is join the two tables and end up with a table that matched the user in df1 with the closest timestamp in df2, as long as the timestamp in df2 happened after the one in df1.

例如,如果我有两个像这样的表:

For example, if I had two tables like:

df1 <- data.frame(c(1,1,2,3), as.POSIXct(c('2016-12-01 08:53:20', '2016-12-01 12:45:47', '2016-12-01 15:34:54', '2016-12-01 00:49:50')))
names(df1) <- c('user', 'time')

df2 <- data.frame(c(1,1,3), as.POSIXct(c('2016-12-01 07:11:01', '2016-   12-01 11:50:11', '2016-12-01 01:19:10')))
names(df2) <- c('user', 'time')

我们:

> df1
  user                time
1    1 2016-12-01 08:53:20
2    1 2016-12-01 12:45:47
3    2 2016-12-01 15:34:54
4    3 2016-12-01 00:49:50

> df2
  user                time
1    1 2016-12-01 07:11:01
2    1 2016-12-01 11:50:11
3    3 2016-12-01 01:19:10

我希望得到的输出看起来像:

The output I'd hope to get to would look like:

user              time_1                 time_2
 1   2016-12-01 08:53:20    2016-12-01 11:50:11
 1   2016-12-01 12:45:47    NA
 2   2016-12-01 15:34:54    NA
 3   2016-12-01 00:49:50    2016-12-01 01:19:10

我很困在使此联接生效的路上。作为额外的复杂性层,如果有一个参数控制时间窗口以允许匹配(例如,仅连接 df2 中的行,则我会喜欢它) X 分钟的 df1 ),但这确实是主要问题的次要问题。

I'm pretty stuck on the way to get this join to work out. As an additional layer of complexity I'd love it if there was a parameter that controlled the time window to allow a match (ie only join the row from df2 if it is within X minutes of df1), but really this is secondary to the main problem.

推荐答案

第1部分-原始问题



问题的第一部分可用 sqldf 软件包。

library(sqldf)
df3 <- sqldf("SELECT * FROM df1 a 
             LEFT JOIN df2 b ON a.time < b.time 
             AND a.user = b.user")[,c(1:2, 4)]

#rename to match OP post
names(df3) <- c("user", "time_1", "time_2")

> df3
  user              time_1              time_2
1    1 2016-12-01 08:53:20 2016-12-01 11:50:11
2    1 2016-12-01 12:45:47                <NA>
3    2 2016-12-01 15:34:54                <NA>
4    3 2016-12-01 00:49:50 2016-12-01 01:19:10



第2部分-时间窗口



如果您希望允许时间窗口进行匹配,则可以在内减去秒SQL 语句如下:

df3 <- sqldf("SELECT * FROM df1 a 
             LEFT JOIN df2 b ON a.time < (b.time - 10000)
             AND a.user = b.user")[,c(1:2, 4)]
> df3
  user                time              time.1
1    1 2016-12-01 08:53:20 2016-12-01 11:50:11
2    1 2016-12-01 12:45:47                <NA>
3    2 2016-12-01 15:34:54                <NA>
4    3 2016-12-01 00:49:50                <NA>

注意,无论您从 b.time 将在 之内。

Note, whatever you select from b.time will be in seconds.

这篇关于如何在时间列上组合基于R数据帧的约束的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆