在两个时间戳之间左加入R [英] Left joining in R between two timestamps
问题描述
我的目标是在间隔
上执行左联接,其中 bike_id
匹配,而记录
中的> created_at 时间戳在 start
和 end <之间/ code>在
间隔
表中
My goal is to perform a left join on intervals
where the bike_id
matches and the created_at
timestamp in records
is BETWEEN start
and end
in the intervals
table
> class(records)
[1] "data.table" "data.frame"
> class(intervals)
[1] "data.table" "data.frame"
> records
bike_id created_at resolved_at
1 28780 2019-05-03 08:29:18 2019-05-03 08:35:37
2 28780 2019-05-03 21:05:28 2019-05-03 21:07:28
3 28780 2019-05-04 21:13:39 2019-05-04 21:15:40
4 28780 2019-05-07 17:24:20 2019-05-07 17:26:39
5 28780 2019-05-08 11:34:32 2019-05-08 12:16:44
6 28780 2019-05-08 23:38:39 2019-05-08 23:40:36
> intervals
bike_id start end id
1: 28780 2019-05-03 04:44:45 2019-05-03 16:58:56 1
2: 28780 2019-05-04 07:07:39 2019-05-04 14:48:29 2
3: 28780 2019-05-07 23:28:32 2019-05-08 12:56:24 3
4: 28780 2019-05-10 06:06:21 2019-05-10 13:12:08 4
5: 28780 2019-05-12 05:21:24 2019-05-12 11:35:52 5
6: 28780 2019-05-13 08:44:54 2019-05-13 12:28:31 6
在这种情况下,输出看起来像
In this case, the output would look like
> output
bike_id created_at resolved_at id
1 28780 2019-05-03 08:29:18 2019-05-03 08:35:37 1
2 28780 2019-05-03 21:05:28 2019-05-03 21:07:28 NULL
3 28780 2019-05-04 21:13:39 2019-05-04 21:15:40 NULL
4 28780 2019-05-07 17:24:20 2019-05-07 17:26:39 NULL
5 28780 2019-05-08 11:34:32 2019-05-08 12:16:44 NULL
6 28780 2019-05-08 23:38:39 2019-05-08 23:40:36 NULL
我已尝试使用解决方案使用 tidyverse
张贴在此处,但这会导致R用尽内存(尽管两个表中的记录量仅约10万)
I have tried using the solution posted here using tidyverse
but this causes R to run out of memory (although the amount of record in both tables are only about 100K)
fuzzy_left_join(
records, intervals,
by = c(
"bike_id" = "bike_id",
"created_at" = "start",
"created_at" = "end"
),
match_fun = list(`==`, `>=`, `<=`)
) %>%
select(id, bike_id = bike_id.x, created_at, start, end)
这将引发错误:错误:向量内存已耗尽(已达到限制?)
在 data.table
或什至在R中使用 merge()
?通过id联接两个数据帧以及联接表中其他两个之间的时间戳的好方法是什么?
Is there an alternative method with rolling join in data.table
or even in base R using merge()
? What is a good method to join two dataframes by id and where a timestamp is between two others n the join table?
以下是数据
dput(intervals)
structure(list(bike_id = c(28780L, 28780L, 28780L, 28780L, 28780L,
28780L), start = structure(c(1556858685, 1556953659, 1557271712,
1557468381, 1557638484, 1557737094), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), end = structure(c(1556902736, 1556981309,
1557320184, 1557493928, 1557660952, 1557750511), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), id = c(1, 2, 3, 4, 5, 6)), row.names = c(NA,
-6L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x1030056e0>)
dput(records)
structure(list(bike_id = c(28780L, 28780L, 28780L, 28780L, 28780L,
28780L), created_at = structure(c(1556872158.796, 1556917528.845,
1557004419.928, 1557249860.939, 1557315272.396, 1557358719.333
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), resolved_at = structure(c(1556872537.867,
1556917648.118, 1557004540.056, 1557249999.892, 1557317804.183,
1557358836.202), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA,
6L), class = "data.frame")
推荐答案
我知道OP要求 tidyverse
或 data.table
解决方案,但SQL似乎是解决此问题的完美工具:
I know OP asked for a tidyverse
or data.table
solution, but SQL seems to be the perfect tool for this:
library(sqldf)
sqldf("select a.*, b.id
from records as a
left join intervals as b
on a.bike_id = b.bike_id and
a.created_at >= b.start and
a.created_at <= b.end")
或在之间使用 :
or use between
for an alternate syntax:
sqldf("select a.*, b.id
from records as a
left join intervals as b
on a.bike_id = b.bike_id and
a.created_at between b.start and b.end")
编辑:如@G所述。 Grothendieck,我们可以在读取数据以匹配OP的时区之前设置环境的时区(使用 Sys.setenv
)。
As noted by @G. Grothendieck, we can set the timezone of the environment (with Sys.setenv
) before reading in the data to match OP's timezone.
输出:
bike_id created_at resolved_at id
1 28780 2019-05-03 08:29:18 2019-05-03 08:35:37 1
2 28780 2019-05-03 21:05:28 2019-05-03 21:07:28 NA
3 28780 2019-05-04 21:13:39 2019-05-04 21:15:40 NA
4 28780 2019-05-07 17:24:20 2019-05-07 17:26:39 NA
5 28780 2019-05-08 11:34:32 2019-05-08 12:16:44 3
6 28780 2019-05-08 23:38:39 2019-05-08 23:40:36 NA
数据:(OP的 dput
之所以起作用是因为从 data.table
)创建的指针
Data: (OP's dput
does work because of the pointer created from data.table
)
Sys.setenv(TZ = "GMT")
records <- structure(list(bike_id = c(28780L, 28780L, 28780L, 28780L, 28780L,
28780L), created_at = c("2019-05-03 08:29:18", "2019-05-03 21:05:28",
"2019-05-04 21:13:39", "2019-05-07 17:24:20", "2019-05-08 11:34:32",
"2019-05-08 23:38:39"), resolved_at = c("2019-05-03 08:35:37",
"2019-05-03 21:07:28", "2019-05-04 21:15:40", "2019-05-07 17:26:39",
"2019-05-08 12:16:44", "2019-05-08 23:40:36")), class = "data.frame", row.names = c(NA,
-6L))
intervals <- structure(list(bike_id = c(28780L, 28780L, 28780L, 28780L, 28780L,
28780L), start = c("2019-05-03 04:44:45", "2019-05-04 07:07:39",
"2019-05-07 23:28:32", "2019-05-10 06:06:21", "2019-05-12 05:21:24",
"2019-05-13 08:44:54"), end = c("2019-05-03 16:58:56", "2019-05-04 14:48:29",
"2019-05-08 12:56:24", "2019-05-10 13:12:08", "2019-05-12 11:35:52",
"2019-05-13 12:28:31"), id = c(1, 2, 3, 4, 5, 6)), class = "data.frame", row.names = c(NA,
-6L))
这篇关于在两个时间戳之间左加入R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!