R-在持续时间/间隔内合并两个数据集 [英] R- merging two data sets within time duration/intervals

查看:70
本文介绍了R-在持续时间/间隔内合并两个数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我仍在学习R,并且尝试合并来自两个不同data.table的两个数据集并在时间间隔内进行匹配时遇到了麻烦。例如,给定table1_schedule和table2_schedule:

I am still learning R and having trouble trying to merge two data sets from two different data.table and match it within the time interval. For example given table1_schedule and table2_schedule:


table1_schedule

table1_schedule



Channel    Program      program_Date    start_time
HBO        Mov A        1/1/2018        21:00
HBO        Mov B        1/1/2018        23:00
HBO        Mov C        1/1/2018        23:59
NatGeo     Doc A        1/1/2018        11:00
NatGeo     Doc B        1/1/2018        11:30
NatGeo     Doc C        1/1/2018        12:00
NatGeo     Doc D        1/1/2018        14:00




table2_watch

table2_watch



Person    Channel        program_Date       start_time    end_time
Name A    NatGeo             1/1/2018        11:00        12:00
Name B    NatGeo             1/1/2018        12:30        14:00         
Name B    HBO                1/1/2018        21:30        22:00
Name B    HBO                1/1/2018        22:30        23:30

目标是合并在 start_time之间运行的程序和table2_watch表的 end_time,并每次添加该人在该时间间隔内观看的节目。例如,

The goal is to merge the programs that run between the "start_time" and "end_time" of the table2_watch table and add the programs watched by the person during that time interval each time. For example,


所需的输出

The wanted output



  Person    Channel   program_Date  start_time  end_time  Prog1  Prog2  Prog3
Name A    NatGeo      1/1/2018      11:00       12:00     Doc A  Doc B  Doc C       
Name B    NatGeo      1/1/2018      12:30       14:00     Doc C  Doc D  -NA- 
Name B    HBO         1/1/2018      21:30       22:00     Mov A  -NA-   -NA- 
Name B    HBO         1/1/2018      22:30       23:30     Mov A  Mov B  -NA-

有以最简单,最有效的方式做到这一点的方法,例如使用 dplyr 或任何其他最适合此类问题的R命令?并且仅在超过10分钟的时间间隔内添加观看的节目,然后添加该人观看了下一个节目。谢谢

Is there a way to do this in the simplest and most efficient way such as using dplyr or any other R commands best for this type of problem? And add the watched programs during the time interval only if it goes beyond 10 minutes then add that the person watched the next program. Thanks

推荐答案

这是一个 data.table 解决方案,使用 foverlap

Here is a data.table solution where we can make use foverlap.

我正在用简短的注释显示每一步,希望对理解有所帮助。

I'm showing every step with a short comment, to hopefully help with understanding.

library(data.table)

# Convert date & time to POSIXct
# Note that foverlap requires a start and end date, so we create an end date
# from the next start date per channel using shift for df1
setDT(df1)[, `:=`(
    time1 = as.POSIXct(paste(program_Date, start_time), format = "%d/%m/%Y %H:%M"),
    time2 = as.POSIXct(paste(program_Date, shift(start_time, 1, type = "lead", fill = start_time[.N])), format = "%d/%m/%Y %H:%M")), by = Channel]
setDT(df2)[, `:=`(
    start = as.POSIXct(paste(program_Date, start_time), format = "%d/%m/%Y %H:%M"),
    end = as.POSIXct(paste(program_Date, end_time), format = "%d/%m/%Y %H:%M"))]

# Remove unnecessary columns in preparation for final output
df1[, `:=`(program_Date = NULL, start_time = NULL)]
df2[, `:=`(program_Date = NULL, start_time = NULL, end_time = NULL)]

# Join on channel and overlapping intervals
# Once joined, remove time1 and time2
setkey(df1, Channel, time1, time2)
dt <- foverlaps(df2, df1, by.x = c("Channel", "start", "end"), nomatch = 0L)
dt[, `:=`(time1 = NULL, time2 = NULL)]

# Spread long to wide
dt[, idx := paste0("Prog",1:.N), by = c("Channel", "Person", "start")]
dcast(dt, Channel + Person + start + end ~ idx, value.var = "Program")[order(Person, start)]
#   Channel Person               start                 end Prog1 Prog2 Prog3
#1:  NatGeo Name A 2018-01-01 11:00:00 2018-01-01 12:00:00 Doc A Doc B Doc C
#2:  NatGeo Name B 2018-01-01 12:30:00 2018-01-01 14:00:00 Doc C Doc D    NA
#3:     HBO Name B 2018-01-01 21:30:00 2018-01-01 22:00:00 Mov A    NA    NA
#4:     HBO Name B 2018-01-01 22:30:00 2018-01-01 23:30:00 Mov A Mov B    NA






样本数据




Sample data

df1 <- read.table(text =
    "Channel    Program      program_Date    start_time
HBO        'Mov A'        1/1/2018        21:00
HBO        'Mov B'        1/1/2018        23:00
HBO        'Mov C'        1/1/2018        23:59
NatGeo     'Doc A'        1/1/2018        11:00
NatGeo     'Doc B'        1/1/2018        11:30
NatGeo     'Doc C'        1/1/2018        12:00
NatGeo     'Doc D'        1/1/2018        14:00", header = T)


df2 <- read.table(text =
    "Person    Channel        program_Date       start_time    end_time
'Name A'    NatGeo             1/1/2018        11:00        12:00
'Name B'    NatGeo             1/1/2018        12:30        14:00
'Name B'    HBO                1/1/2018        21:30        22:00
'Name B'    HBO                1/1/2018        22:30        23:30", header = T)

这篇关于R-在持续时间/间隔内合并两个数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆