使用dplyr填补开始日期和结束日期之间的所有空白 [英] Fill all gaps between starting and ending dates with dplyr

查看:55
本文介绍了使用dplyr填补开始日期和结束日期之间的所有空白的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个看起来像这样的data.frame:

Let's say I have a data.frame that looks like this:

user_df = read.table(text = "person_id job_number job_type start_date end_date
                  1 1 B 2012-11-01 2014-01-01
                  1 2 A 2016-02-01 2016-10-01
                  1 3 A 2016-12-01 2020-01-01
                  1 4 B 2020-01-01 2021-01-01
                  2 1 A 2011-03-01 2012-08-01
                  2 2 B 2013-01-01 2020-01-01
                  2 3 A 2020-01-01 2021-01-01
                  2 4 B 2021-01-01 2021-01-17
                  3 1 A 2005-03-01 2011-03-01
                  3 2 B 2012-01-01 2014-01-01", header = T)

每个 person_id 都有给定工作的开始和结束日期.我想在作业之间的空白之间插入行,并创建一个名为 unemployed 的附加列,这些列设置为1.

Each person_id has a start and an ending date for a given job. I would like to insert rows in between the empty space between jobs, and create an additional column called unemployed that is set to 1 for those columns.

前几行的结果data.frame看起来像这样:

The resulting data.frame for the first several rows would look like this:

user_df = read.table(text = "person_id job_number job_type start_date end_date unemployed
                  1 1 B 2012-11-01 2014-01-01 0
                  1 1 B 2014-01-01 2016-02-01 1
                  1 2 A 2016-02-01 2016-10-01 0
                  1 2 A 2016-10-01 2016-12-01 1
                  1 3 A 2016-12-01 2020-01-01 0
                  1 4 B 2020-01-01 2021-01-01 0
                  2 1 A 2011-03-01 2012-08-01 0
                  2 1 A 2012-08-01 2013-01-01 1
                  2 2 B 2013-01-01 2020-01-01 0", header = T)

因此,我实际上是在插入新行,其中前一行的结束日期为开始日期,下一行的开始日期为结束日期.

So I'm essentially inserting a new row with the previous rows' end date as its start date and the next row's start date as its end date.

不知道从哪里开始.通过简单地将最早的开始日期和最后的结束日期之间的天数相加,然后从每行实际累积的总时间中减去,就可以计算出待用的总时间.但是我不确定如何以编程方式在dplyr链中插入行以填补失业时间.

Not sure where to even start with this. I was able to compute the total amount of time spent unemployed by simply summing up the days spanning the earliest start date and the last ending date and subtracting that from the total time actually accumulated by each row. But I'm not sure how I'd go about programatically inserting rows within a dplyr chain to fill in the unemployed time.

推荐答案

library(dplyr)
user_df %>%
  arrange(start_date) %>%
  group_by(person_id) %>%
  mutate(nextstart = lead(start_date)) %>%
  filter(end_date < nextstart) %>%
  mutate(start_date = end_date, end_date = nextstart, unemployed = 1L) %>%
  select(-nextstart) %>%
  bind_rows(mutate(user_df, unemployed = 0L)) %>%
  arrange(person_id, start_date) %>%
  ungroup()
# # A tibble: 14 x 6
#    person_id job_number job_type start_date end_date   unemployed
#        <int>      <int> <chr>    <chr>      <chr>           <int>
#  1         1          1 B        2012-11-01 2014-01-01          0
#  2         1          1 B        2014-01-01 2016-02-01          1
#  3         1          2 A        2016-02-01 2016-10-01          0
#  4         1          2 A        2016-10-01 2016-12-01          1
#  5         1          3 A        2016-12-01 2020-01-01          0
#  6         1          4 B        2020-01-01 2021-01-01          0
#  7         2          1 A        2011-03-01 2012-08-01          0
#  8         2          1 A        2012-08-01 2013-01-01          1
#  9         2          2 B        2013-01-01 2020-01-01          0
# 10         2          3 A        2020-01-01 2021-01-01          0
# 11         2          4 B        2021-01-01 2021-01-17          0
# 12         3          1 A        2005-03-01 2011-03-01          0
# 13         3          1 A        2011-03-01 2012-01-01          1
# 14         3          2 B        2012-01-01 2014-01-01          0

从技术上讲,这是按字母排序的日期进行比较;在这种情况下,它的效果是相同的(格式很不错),尽管效率会稍低(整数/数字排序比字母排序快).

Technically, this is comparing by the alphabetic sort of dates; in this case, its effect is the same (the format is good for that) though it'll be slightly less efficient (integer/numeric sorting is faster than alphabetic sorting).

这是通过首先创建然后捕获仅失业时间段来实现的,

This works by first creating and then capture just the unemployed periods of time,

user_df %>%
  arrange(start_date) %>%
  group_by(person_id) %>%
  mutate(nextstart = lead(start_date)) %>%
  filter(end_date < nextstart)
# # A tibble: 4 x 6
# # Groups:   person_id [3]
#   person_id job_number job_type start_date end_date   nextstart 
#       <int>      <int> <chr>    <chr>      <chr>      <chr>     
# 1         3          1 A        2005-03-01 2011-03-01 2012-01-01
# 2         2          1 A        2011-03-01 2012-08-01 2013-01-01
# 3         1          1 B        2012-11-01 2014-01-01 2016-02-01
# 4         1          2 A        2016-02-01 2016-10-01 2016-12-01

然后移动变量,然后添加 unemployed ,最后将其返回到原始数据集.在这种情况下,我将 unemployed 添加到原始的 bind_rows 中;这样做主要是在偏好上进行的.

then shifting the variables, then adding unemployed, and then finally returning it to the original dataset. In this case, I added unemployed to the original mid-bind_rows; where to do this is mostly preference.

这篇关于使用dplyr填补开始日期和结束日期之间的所有空白的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆