如何根据2列中的值给出的日期范围删除行? [英] How do I remove rows based on a range of dates given by values in 2 columns?

查看:35
本文介绍了如何根据2列中的值给出的日期范围删除行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含日期范围的数据集,需要在新行中填写缺少的日期. df1 是我正在使用的数据的一个示例,而 df2 是我已成功实现的一个示例(卡住了). df3 是我想要结束的地方!

I have a data set that includes a range of dates and need to fill in the missing dates in new rows. df1 is an example of the data I am working with and df2 is an example of what I've managed to achieve (where I'm stuck). df3 is where I would like to end up!

df1
ID     Date       DateStart     DateEnd
1      2/11/2021  2/11/2021     2/17/2021
1      2/19/2021  2/19/2021     2/21/2021
2      1/15/2021  1/15/2021     1/20/2021  
2      1/22/2021  1/22/2021     1/23/2021  

这就是我在这里的地方.NA并不是问题,因为我打算在完成所需的操作后删除DateStart和DateEnd列.这里的问题是我不想包括先前的DateStart和DateEnd范围内的日期.为了到达这里,我按ID分组,并在 df1 中的日期之间填写了缺少的日期:

This is where I am with this. The NAs aren't an issue because I intend to drop the DateStart and DateEnd columns after doing what I need to do. The issue here is that I don't want to include the dates that fall within the previous DateStart and DateEnd range. To get here I grouped by ID and filled in the missing dates between the dates in df1:

df2
ID     Date       DateStart     DateEnd
1      2/11/2021  2/11/2021     2/17/2021
1      2/12/2021  NA            NA
1      2/13/2021  NA            NA
1      2/14/2021  NA            NA
1      2/15/2021  NA            NA
1      2/16/2021  NA            NA
1      2/17/2021  NA            NA
1      2/18/2021  NA            NA
1      2/19/2021  2/19/2021     2/21/2021
2      1/15/2021  1/15/2021     1/20/2021
2      1/16/2021  NA            NA
2      1/17/2021  NA            NA
2      1/18/2021  NA            NA
2      1/19/2021  NA            NA
2      1/20/2021  NA            NA
2      1/21/2021  NA            NA
2      1/22/2021  NA            NA    
2      1/23/2021  1/23/2021     1/24/2021  

这实际上是我想要得到的:

This is actually what I'd like to end up with:

df3
ID     Date       DateStart     DateEnd
1      2/11/2021  2/11/2021     2/17/2021
1      2/18/2021  NA            NA
1      2/19/2021  2/19/2021     2/21/2021
2      1/15/2021  1/15/2021     1/20/2021
2      1/21/2021  NA            NA
2      1/22/2021  NA            NA    
2      1/23/2021  1/23/2021     1/24/2021  

df3 中,缺少的日期会被填写,但DateStart-DateEnd范围内的日期不会被填写.

In df3 the missing dates are filled in but not the dates within the DateStart-DateEnd range.

关于如何实现这一目标的任何想法?注意:我有一个包含大量观察值的数据集.

Any thoughts on how to achieve this? Note: I have a dataset with a large number of observations.

推荐答案

  • 将日期列转换为日期类.

    • Convert date columns to date class.

      对于每个 ID ,使用 complete 创建日期序列,从最小的 DateStart 到最大的 DateEnd .

      For each ID use complete to create sequence of dates from minimum of DateStart to maximum of DateEnd.

      使用先前的非NA填充 NA 值,除非 Date>DateEnd .

      对于每组 ID DateStart DateEnd 均使用 NA 值保留行,或每组第1行.

      For every group of ID, DateStart and DateEnd keep the rows with NA values or row number 1 in each group.

      library(dplyr)
      library(tidyr)
      
      df %>%
        mutate(across(-ID, lubridate::mdy)) %>%
        group_by(ID) %>%
        complete(Date = seq(min(DateStart), max(DateEnd), by = '1 day')) %>%
        fill(DateStart, DateEnd) %>%
        ungroup %>%
        mutate(across(c(DateStart, DateEnd), ~replace(., Date > DateEnd, NA))) %>%
        group_by(ID, DateStart, DateEnd) %>%
        filter(is.na(DateStart) | row_number() == 1)
      
      #     ID Date       DateStart  DateEnd   
      #  <int> <date>     <date>     <date>    
      #1     1 2021-02-11 2021-02-11 2021-02-17
      #2     1 2021-02-18 NA         NA        
      #3     1 2021-02-19 2021-02-19 2021-02-21
      #4     2 2021-01-15 2021-01-15 2021-01-20
      #5     2 2021-01-21 NA         NA        
      #6     2 2021-01-22 NA         NA        
      #7     2 2021-01-23 2021-01-23 2021-01-24
      

      数据

      df <- structure(list(ID = c(1L, 1L, 2L, 2L), Date = c("2/11/2021", 
      "2/19/2021", "1/15/2021", "1/23/2021"), DateStart = c("2/11/2021", 
      "2/19/2021", "1/15/2021", "1/23/2021"), DateEnd = c("2/17/2021", 
      "2/21/2021", "1/20/2021", "1/24/2021")), 
      class = "data.frame", row.names = c(NA, -4L))
      

      这篇关于如何根据2列中的值给出的日期范围删除行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆