标志间隔在r中重叠的行 [英] Flag rows with interval overlap in r

查看:123
本文介绍了标志间隔在r中重叠的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含电视收看数据的df帧,我想对重叠收看进行质量检查.假设对于同一天,同一家庭,对于每个人来说,每分钟应只记入一个电台或一个频道.

I have a df frame containing TV viewing data, I would like to run a QC check for overlapping viewing. Let's say for the same day, same household, for each individual, each minute should be credited to one station or channel only.

例如,我想标记第8、9行,因为在唯一的家庭中,个人似乎不可能同时(start_hour_minute)观看两个电视台(62,67).我想知道是否有办法标记这些行? 一种按分钟排序的视图,按个人按日查看.

for example, I would like to flag line 8 , 9 , because it seem impossible an individual in a unique household watched two TV stations (62,67) at the same time (start_hour_minute) . I am wondering is there a way to flag this rows? A sort of min by min view by individual by day.

df <- data.frame(stringsAsFactors=FALSE,
         date = c("2018-09-02", "2018-09-02", "2018-09-02", "2018-09-02",
                  "2018-09-02", "2018-09-02", "2018-09-02", "2018-09-02",
                  "2018-09-02"),
         householdID = c(18101276L, 18101276L, 18102843L, 18102843L, 18102843L,
                  18102843L, 18104148L, 18104148L, 18104148L),
   Station_id = c(74L, 74L, 62L, 74L, 74L, 74L, 62L, 62L, 67L),
        IndID = c("aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa"),
        Start = c(111300L, 143400L, 030000L, 034900L, 064400L, 070500L, 060400L,
                  075100L, 075100L),
          End = c(111459L, 143759L, 033059L, 035359L, 064759L, 070559L, 060459L,
                  81559L, 81559L),
   start_hour_minute = c(1113L, 1434L, 0300L, 0349L, 0644L, 0705L, 0604L, 0751L, 0751L),
     end_hour_minute = c(1114L, 1437L, 0330L, 0353L, 0647L, 0705L, 0604L, 0815L, 0815L))

推荐答案

lubridate包具有inteval类对象和%within%函数,该函数检查时间戳是否在间隔内.您可以使用它来获取标志.

The lubridate package has an inteval class object and the %within% function that checks if a timestamp is within an interval. You can use this to get flags.

使用您在上方提供的虚拟数据...

Using the dummy data you provided above...

data_out <- df %>% 
# Get the hour, minute, and second values as standalone numerics.
mutate(
    date = ymd(date),
    Start_Hour = floor(Start / 10000),
    Start_Minute = floor((Start - Start_Hour*10000) / 100),
    Start_Second = (Start - Start_Hour*10000) - Start_Minute*100,
    End_Hour = floor(End / 10000),
    End_Minute = floor((End - End_Hour*10000) / 100),
    End_Second = (End - End_Hour*10000) - End_Minute*100,
# Use the hour, minute, second values to create a start-end timestamp.
    Start_TS = ymd_hms(date + hours(Start_Hour) + minutes(Start_Minute) + seconds(Start_Second)),
    End_TS = ymd_hms(date + hours(Start_Hour) + minutes(Start_Minute) + seconds(Start_Second)),
# Create an interval object.
    Watch_Interval = interval(start = Start_TS, end = End_TS)
) %>% 
# Group by the IDs.
group_by(householdID, Station_id) %>% 
# Flag where the household's interval overlaps with another time.
mutate(
    overlap_flag = case_when(
        sum(Start_TS %within% as.list(Watch_Interval)) == 0 ~ 0,
        sum(Start_TS %within% as.list(Watch_Interval)) > 0 ~ 1,
        TRUE ~ NA_real_
    )
) %>% 
# dplyr doesn't play nice with interval objects, so we should remove Watch_Interval.
select(-Watch_Interval)

使用data_out %>% filter(overlap_flag == 1)查看标记的值.

注意:dplyrlubridate软件包并非总是能很好地配合使用,尤其是较旧的版本.您可能需要更新每个软件包的版本.

Note: The dplyr and lubridate packages don't always play nice together, especially older versions. You may need to update the package versions for each.

这篇关于标志间隔在r中重叠的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆