根据条件通过ID组合重叠日期 [英] Combining overlapping dates by ID based on a condition

查看:64
本文介绍了根据条件通过ID组合重叠日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想根据条件为每个ID选择开始日期和结束日期。

I would like to select start and end dates for each ID based on a condition.

对于每个ID,如果结束日期和开始日期之间的差是< = 14天,那么我想获取该行的结束日期并采用从上面的行开始的日期。就是合并时间间隔少于14天的时间。

For each ID, if the difference between the end date and start date is <=14 days then I would like to take the end date for that row and take the start date from the row above. Ie. combine time periods for intervals less than 14 days.

对于ID为45和28的我已经可以执行此操作,但对于ID却不超过14天的81个,则无法执行此操作。

I have been able to do this for the ID's 45 and 28, but not for 81, which has several dates which are less than 14 days.

我已附加我的数据以及最终希望得到的数据。

I have attached my data and what I hope to end up with.

ID  STARTDATE   ENDDATE     Difference  
45  2004-09-04  2004-10-09  NA
45  2004-11-04  2004-12-08  26      
28  2013-07-25  2013-08-28  NA      
28  2013-08-27  2017-04-06  -1
81  2013-02-22  2013-03-28  NA
81  2013-03-25  2013-04-26  -3
81  2013-04-24  2013-05-26  -2
81  2013-05-22  2013-06-23  -4
81  2013-06-24  2013-07-26  1
81  2013-07-22  2013-08-23  -4


ID  STARTDATE   ENDDATE     Difference      startdate     enddate
45  2004-09-04  2004-10-09    NA            2004-09-04  2004-10-09
45  2004-11-04  2004-12-08    26            2004-11-04  2004-12-08
28  2013-08-27  2017-04-06    -1            2013-07-25  2017-04-06
81  2013-07-22  2013-08-23    -4            2013-02-22  2013-08-23

新数据样本

ID  START_DATE  end.date.plus   end.date    start.date  
75  18/10/11    21/11/11    1/01/70 1/01/70  
46  2/10/09     8/08/10     1/01/70 1/01/70  
45  4/09/04     9/10/04     1/01/70 1/01/70  
45  4/11/04     8/12/04     1/01/70 1/01/70  
28  25/07/13    28/08/13    1/01/70 1/01/70  
28  27/08/13    6/04/17     1/01/70 1/01/70  
81  22/02/13    28/03/13    1/01/70 1/01/70  
81  25/03/13    26/04/13    1/01/70 1/01/70  
81  24/04/13    26/05/13    1/01/70 1/01/70  
81  22/05/13    23/06/13    1/01/70 1/01/70  
81  24/06/13    26/07/13    1/01/70 1/01/70  
81  22/07/13    23/08/13    1/01/70 1/01/70  


推荐答案

我建议使用下面的功能通过ID计算分组的结构(差异列,我将其保留在数据框中,但是无关紧要) 。首先,使用您的示例;

I suggest the below function that calculates the structure of the groups by ID (the column Difference, I keep it in the data frame, however, it is irrelevant). Firstly, using your example;

data <- read.table(text=
                  "ID  STARTDATE   ENDDATE     Difference  
                   45  2004-09-04  2004-10-09  NA
                   45  2004-11-04  2004-12-08  26      
                   28  2013-07-25  2013-08-28  NA      
                   28  2013-08-27  2017-04-06  -1
                   81  2013-02-22  2013-03-28  NA
                   81  2013-03-25  2013-04-26  -3
                   81  2013-04-24  2013-05-26  -2
                   81  2013-05-22  2013-06-23  -4
                   81  2013-06-24  2013-07-26  1
                   81  2013-07-22  2013-08-23  -4", header=T)

continuum <- function(data){
  library(parsedate, quietly=T) #access to parse_date() function for automatic recognition of date format
  data[,c("STARTDATE", "ENDDATE")] <- lapply(data[,c("STARTDATE", "ENDDATE")], function(e) as.Date(parse_date(e)))
  data <- data[with(data, order(ID, STARTDATE)),]
  data$diffr <- 0
  result <- data.frame()
  for ( i in unique(data$ID)){
    temp <-data[data$ID==i,]
    if(length(temp$ID)==1){
      startdate <- temp$STARTDATE
      enddate <- temp$ENDDATE
    } else{
    for(j in 1:(length(temp$ID)-1)){
      temp$diffr[j+1] <- difftime(temp$STARTDATE[j+1], temp$ENDDATE[j])
    }
    startdate <- c(temp$STARTDATE[temp$diffr==0], temp$STARTDATE[temp$diffr>14])
      if(identical(rep(TRUE, length(temp$ID)), temp$diffr<=14)){
        enddate <- max(temp$ENDDATE)
      } else{
        enddate <- c(temp$ENDDATE[match(temp$ENDDATE[temp$diffr>14], temp$ENDDATE)-1], temp$ENDDATE[length(temp$diffr)])
      }
      } 
    result <- rbind(result, 
                    data.frame(
                      ID=rep(i, length(startdate)),
                      startdate=startdate,
                      enddate=enddate))
  }
  return(result)
}

continuum(data)
#  ID  startdate    enddate
#1 28 2013-07-25 2017-04-06
#2 45 2004-09-04 2004-10-09
#3 45 2004-11-04 2004-12-08
#4 81 2013-02-22 2013-08-23

第二,在更复杂的示例上:

Secondly, on a bit more complicated example:

data2 <- read.table(text=
                  "ID  STARTDATE   ENDDATE     Difference  
                   45  2004-09-04  2004-10-09  NA
                   45  2004-11-04  2004-12-08  26      
                   28  2013-07-25  2013-08-28  NA      
                   28  2013-08-27  2017-04-06  -1
                   81  2013-02-22  2013-03-28  NA
                   81  2013-03-25  2013-04-26  -3
                   81  2013-04-24  2013-05-26  -2
                   81  2013-05-22  2013-06-23  -4
                   81  2013-06-24  2013-07-26  1
                   81  2013-07-22  2013-08-23  -4
                   81  2014-05-01  2015-06-02  8 
                   81  2015-07-05  2015-09-06  9", header=T)
continuum(data2)
#  ID  startdate    enddate
#1 28 2013-07-25 2017-04-06
#2 45 2004-09-04 2004-10-09
#3 45 2004-11-04 2004-12-08
#4 81 2013-02-22 2013-08-23
#5 81 2014-05-01 2015-06-02
#6 81 2015-07-05 2015-09-06

EDIT:已对功能进行了调整,它会自动识别日期格式(至少您到目前为止提供的格式,不声称它会识别乱码)。现在按照您的新示例进行操作:

The function was adjusted, it automatically recognizes the date format (at least the formats you provided so far, not claiming it will recognize jibberish). Now follows your new, more elaborate example:

data3 <- read.table(text="
                    ID START_DATE end.date.plus end.date start.date
                    75 18/10/11 21/11/11 1/01/70 1/01/70
                    46 2/10/09 8/08/10 1/01/70 1/01/70
                    45 4/09/04 9/10/04 1/01/70 1/01/70
                    45 4/11/04 8/12/04 1/01/70 1/01/70
                    28 25/07/13 28/08/13 1/01/70 1/01/70
                    28 27/08/13 6/04/17 1/01/70 1/01/70
                    81 22/02/13 28/03/13 1/01/70 1/01/70
                    81 25/03/13 26/04/13 1/01/70 1/01/70
                    81 24/04/13 26/05/13 1/01/70 1/01/70
                    81 22/05/13 23/06/13 1/01/70 1/01/70
                    81 24/06/13 26/07/13 1/01/70 1/01/70
                    81 22/07/13 23/08/13 1/01/70 1/01/70", header=T)

此数据集为与之前不同一个例子,不仅涉及日期格式,这就是为什么该功能不起作用的原因。这也是一个更可靠的示例,也是一个更好的示例,因为您涵盖了两个日期的行为的更多情况,例如ID 45的实例,这种情况是新的(一个子连续体被较长的一个子连续体隐藏),没有发生在前面的例子中。这也使功能更强大!接下来,您需要为函数提供正确的变量名称,分别为 STARTDATE ENDDATE 。我认为 end.date start.date 是虚拟的,这就是为什么我转换 START_DATE STARTDATE end.date.plus ENDDATE ,因为此逻辑是在询问您的问题时设置的。

This data set is different from the previous example, not only in regards to the date format, and that is why the function did not work. It is also a more robust example, better example, as you cover more case of the behavior of the two dates, like the instance of the ID 45, the situation is new (one sub-continuum hid by the longer one), not occurring in the example before. That makes also the function more robust! What you need to do next is to provide the correct names of variables for the function, STARTDATE and ENDDATE. I figure that end.date and start.date are dummies, that is why I convert START_DATE into STARTDATE and end.date.plus to ENDDATE, as this logic was set at asking your question.

names(data3)[2] <- "STARTDATE"
names(data3)[3] <- "ENDDATE"

您可以重命名列,从上方加载函数并将其应用于数据集 data3

You can rename the columns, load the function from above and apply it on data set data3:

continuum(data3)

可打印

 #  ID  startdate    enddate
 #1 28 2013-07-25 2017-06-04
 #2 45 2004-04-09 2004-09-10
 #3 46 2009-02-10 2010-08-08
 #4 75 2011-10-18 2011-11-21
 #5 81 2013-02-22 2013-08-23

EDIT2:我为自己创建了一个复杂的日期示例,并实现了以下功能:

I created myself a complex example of dates and made the following function:

continuum <- function(data){
  data <- data[with(data, order(ID, STARTDATE)),]
  result <- data.frame()
  for ( i in unique(data$ID)){
    temp <-data[data$ID==i,]
    j <- 1
    startdate <- temp$STARTDATE[1]
    enddate <- temp$ENDDATE[1]
    if(length(temp$ID)==1){result <- rbind(result, data.frame(ID=i, STARTDATE=startdate, ENDDATE=enddate))
    } else 
      while(j < length(temp$ID)){
        if(temp$STARTDATE[j+1]-14<=temp$ENDDATE[j]){ 
          startdate <- startdate
          if(temp$ENDDATE[j+1]<=enddate){enddate <- enddate} else{enddate <- temp$ENDDATE[j+1]}
          if(j==(length(temp$ID)-1)){result <- rbind(result, data.frame(ID=i, STARTDATE=startdate, ENDDATE=enddate))}
          j <- j+1
        } else if(temp$STARTDATE[j+1]-14>enddate){
          result <- rbind(result, data.frame(ID=i, STARTDATE=startdate, ENDDATE=enddate))
          startdate <- temp$STARTDATE[j+1]
          enddate <- temp$ENDDATE[j+1]
          if(j==(length(temp$ID)-1)){result <- rbind(result, data.frame(ID=i, STARTDATE=startdate, ENDDATE=enddate))}
          j <- j+1 
        } else{
          if(temp$ENDDATE[j+1]<=enddate){enddate <- enddate} else{enddate <- temp$ENDDATE[j+1]}
          if(j==(length(temp$ID)-1)){result <- rbind(result, data.frame(ID=i, STARTDATE=startdate, ENDDATE=enddate))}
          j <- j+1}
      }
  }
  return(result)
}

请确保您使用R正确解释了日期!这样的日期

Make sure that you the dates are correctly interpreted by R! Dates like this

45 4/11/04 8/12/04 1/01/70 1/01/70 
28 25/07/13 28/08/13 1/01/70 1/01/70

日期格式不是很好,最好是使用 2017-04-23

are not in a good date format, better, go for a format like 2017-04-23,

让我知道它是否对您有用。

Let me know if it worked for you.

这篇关于根据条件通过ID组合重叠日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆