R-在持续时间/间隔内合并两个数据集 [英] R- merging two data sets within time duration/intervals
问题描述
我仍在学习R,并且尝试合并来自两个不同data.table的两个数据集并在时间间隔内进行匹配时遇到了麻烦。例如,给定table1_schedule和table2_schedule:
I am still learning R and having trouble trying to merge two data sets from two different data.table and match it within the time interval. For example given table1_schedule and table2_schedule:
table1_schedule
table1_schedule
Channel Program program_Date start_time
HBO Mov A 1/1/2018 21:00
HBO Mov B 1/1/2018 23:00
HBO Mov C 1/1/2018 23:59
NatGeo Doc A 1/1/2018 11:00
NatGeo Doc B 1/1/2018 11:30
NatGeo Doc C 1/1/2018 12:00
NatGeo Doc D 1/1/2018 14:00
table2_watch
table2_watch
Person Channel program_Date start_time end_time
Name A NatGeo 1/1/2018 11:00 12:00
Name B NatGeo 1/1/2018 12:30 14:00
Name B HBO 1/1/2018 21:30 22:00
Name B HBO 1/1/2018 22:30 23:30
目标是合并在 start_time之间运行的程序和table2_watch表的 end_time,并每次添加该人在该时间间隔内观看的节目。例如,
The goal is to merge the programs that run between the "start_time" and "end_time" of the table2_watch table and add the programs watched by the person during that time interval each time. For example,
所需的输出
The wanted output
Person Channel program_Date start_time end_time Prog1 Prog2 Prog3
Name A NatGeo 1/1/2018 11:00 12:00 Doc A Doc B Doc C
Name B NatGeo 1/1/2018 12:30 14:00 Doc C Doc D -NA-
Name B HBO 1/1/2018 21:30 22:00 Mov A -NA- -NA-
Name B HBO 1/1/2018 22:30 23:30 Mov A Mov B -NA-
有以最简单,最有效的方式做到这一点的方法,例如使用 dplyr
或任何其他最适合此类问题的R命令?并且仅在超过10分钟的时间间隔内添加观看的节目,然后添加该人观看了下一个节目。谢谢
Is there a way to do this in the simplest and most efficient way such as using dplyr
or any other R commands best for this type of problem? And add the watched programs during the time interval only if it goes beyond 10 minutes then add that the person watched the next program. Thanks
推荐答案
这是一个 data.table
解决方案,使用 foverlap
。
Here is a data.table
solution where we can make use foverlap
.
我正在用简短的注释显示每一步,希望对理解有所帮助。
I'm showing every step with a short comment, to hopefully help with understanding.
library(data.table)
# Convert date & time to POSIXct
# Note that foverlap requires a start and end date, so we create an end date
# from the next start date per channel using shift for df1
setDT(df1)[, `:=`(
time1 = as.POSIXct(paste(program_Date, start_time), format = "%d/%m/%Y %H:%M"),
time2 = as.POSIXct(paste(program_Date, shift(start_time, 1, type = "lead", fill = start_time[.N])), format = "%d/%m/%Y %H:%M")), by = Channel]
setDT(df2)[, `:=`(
start = as.POSIXct(paste(program_Date, start_time), format = "%d/%m/%Y %H:%M"),
end = as.POSIXct(paste(program_Date, end_time), format = "%d/%m/%Y %H:%M"))]
# Remove unnecessary columns in preparation for final output
df1[, `:=`(program_Date = NULL, start_time = NULL)]
df2[, `:=`(program_Date = NULL, start_time = NULL, end_time = NULL)]
# Join on channel and overlapping intervals
# Once joined, remove time1 and time2
setkey(df1, Channel, time1, time2)
dt <- foverlaps(df2, df1, by.x = c("Channel", "start", "end"), nomatch = 0L)
dt[, `:=`(time1 = NULL, time2 = NULL)]
# Spread long to wide
dt[, idx := paste0("Prog",1:.N), by = c("Channel", "Person", "start")]
dcast(dt, Channel + Person + start + end ~ idx, value.var = "Program")[order(Person, start)]
# Channel Person start end Prog1 Prog2 Prog3
#1: NatGeo Name A 2018-01-01 11:00:00 2018-01-01 12:00:00 Doc A Doc B Doc C
#2: NatGeo Name B 2018-01-01 12:30:00 2018-01-01 14:00:00 Doc C Doc D NA
#3: HBO Name B 2018-01-01 21:30:00 2018-01-01 22:00:00 Mov A NA NA
#4: HBO Name B 2018-01-01 22:30:00 2018-01-01 23:30:00 Mov A Mov B NA
样本数据
Sample data
df1 <- read.table(text =
"Channel Program program_Date start_time
HBO 'Mov A' 1/1/2018 21:00
HBO 'Mov B' 1/1/2018 23:00
HBO 'Mov C' 1/1/2018 23:59
NatGeo 'Doc A' 1/1/2018 11:00
NatGeo 'Doc B' 1/1/2018 11:30
NatGeo 'Doc C' 1/1/2018 12:00
NatGeo 'Doc D' 1/1/2018 14:00", header = T)
df2 <- read.table(text =
"Person Channel program_Date start_time end_time
'Name A' NatGeo 1/1/2018 11:00 12:00
'Name B' NatGeo 1/1/2018 12:30 14:00
'Name B' HBO 1/1/2018 21:30 22:00
'Name B' HBO 1/1/2018 22:30 23:30", header = T)
这篇关于R-在持续时间/间隔内合并两个数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!