识别时间数据之间的差距 [英] Identify gaps in time data

查看:148
本文介绍了识别时间数据之间的差距的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在下面找到了解决该问题的方法,但是,它适用于小型数据集,但仍会在大型数据集上产生错误输出.有人知道为什么吗?我找不到错误.这是代码:

I found a way for the problem below, however, it works on a small dataset but still creates falses output on large datasets. Someone knows why? I can't find the mistake. Here's the code:

df$continuous <-
  unlist(lapply(split(df, df$ID),
                function(x) {
                  sapply(1:nrow(x),
                         function(y) {
                           any(x$start[y] - x$end[-(y:NROW(x$end))] <= 1)
                         })
                }))

原始问题: 我正在使用一个函数来识别一系列开始/结束日期中的间隔.如果开始日期晚于任何先前结束日期的1天之后,则输出应为FALSE.

ORIGINAL PROBLEM: I'm working on a function to identify a gap in a series of start/end dates. The output should be FALSE if a start date begins later than 1 day after any of the previous end dates.

数据:

df <- data.frame('ID' = c('1','1','1','1','1','1'), 'start' = as.Date(c('2010-01-01', '2010-01-03', '2010-01-05', '2010-01-09','2010-02-01', '2010-02-10')),
                 'end' = as.Date(c('2010-01-03', '2010-01-22', '2010-01-07', '2010-01-12', '2010-02-10', '2010-02-12')))

这是我尝试使用x = starty = end解决此问题的方法:

This is my attempt to solve this with x = start and y = end:

my_fun <- function(x,y){
  any(x[i] - y[1:NROW(i)-1] <= 1)
}

如果我指定i的话效果很好,但是我没有设法将其包装成一个循环.最终,应将此功能以dplyr方式应用于大型数据集中的组.

It works well if I specify i but I don't manage to wrap this into a loop. Ultimately, this function should be applied to groups in a large dataset in a dplyr manner.

它应该是这样的:

  ID      start        end  continuous
1  1 2010-01-01 2010-01-03 FALSE #or TRUE
2  1 2010-01-03 2010-01-22 TRUE
3  1 2010-01-05 2010-01-07 TRUE
4  1 2010-01-09 2010-01-12 TRUE
5  1 2010-02-01 2010-02-10 FALSE
6  1 2010-02-10 2010-02-12 TRUE #according to my function or FALSE compared to start[1] would be even better

非常感谢您的帮助.

推荐答案

您可以使用dplyrlubridate进行此操作. dplyr具有非常有用的窗口功能lag()这类分析很方便.

You can do this using dplyr and lubridate. dplyr has really useful window functions like lag() that are handy for this type of analysis.

library(tidyverse)
library(lubridate)

df %>% 
  mutate(start - lag(end, 1) == 0)

# ID      start        end start - lag(end, 1) == 0
# 1  1 2010-01-01 2010-01-03                       NA
# 2  1 2010-01-03 2010-01-22                     TRUE
# 3  1 2010-01-05 2010-01-07                    FALSE
# 4  1 2010-01-09 2010-01-12                    FALSE
# 5  1 2010-02-01 2010-02-10                    FALSE
# 6  1 2010-02-10 2010-02-12                     TRUE

您如何处理数据的第一行?由于没有先前的值,因此显示NA.通常,这是您应该如何处理这种情况的方法,但是如果您希望它具有不同的值,我可以编辑我的答案.

How do you want to handle the first row of your data? Since there is no previous value, it shows NA. This is generally how you should handle situations like this but I can edit my answer if you'd like it to have a different value.

这篇关于识别时间数据之间的差距的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆