填写R中缺少的日期 [英] Filling missing dates in R

查看:255
本文介绍了填写R中缺少的日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望获得有关分析所需的数据帧转换的一些帮助。我的数据包括大量拥有所有历史性工作的个人。 EX是代表终止雇用原因的代码。像这样的东西:

I would like some help regarding a data frame transformation required for an analysis. My data consists of a large amount of individuals with all their historic employment. "EX" is a code representing the reason for ending employment. Something like this:

id  Date_start    Date_end       EX
13  "2001-02-01"  "2001-05-30"   A
13  "2002-03-01"  "2010-06-02"   B
14  ...           ...
...

所以我想做的是填补空白。这可能并不容易,但是更加困难,因为我希望它通过id进行聚合,并且每个新行都应具有该行之前的EX值,例如:

So what I would like to do is to "fill in the gaps". This may not be easy but its even more difficult because I want it aggregated by id and each new row should have the EX value of the row before, like this:

id  Date_start    Date_end       EX
13  "2001-02-01"  "2001-05-30"   A
13  "2001-05-31"  "2002-02-28"   A
13  "2002-03-01"  "2010-06-02"   B
14  ...           ...
...

我相信诀窍是某种程度的滞后和累加,但我完全迷失了。

I believe the trick would be some kind of lag and aggregate but I'm totally lost.

推荐答案

这有点棘手,您可以主要使用 dplyr 包进行操作,而 lubridate 包来转换日期格式(可以肯定地使用 as.Date(),但是 lubridate 使其更容易)。

This is a little bit tricky, and you can mainly utilize the dplyr package to do the manipulation and lubridate packages to convert the date format(you can use as.Date() for sure, but lubridate makes it easier).

library(dplyr)
library(lubridate)

1。创建您提供的示例数据。

1.Creating the sample data you provided.

names <- c("id", "Date_start",    "Date_end",       "EX")
row1 <- c(13 , "2001-02-01" , "2001-05-30" ,  "A")
row2 <- c(13 , "2002-03-01" , "2010-06-02" ,  "B")


testdata <- rbind(row1,row2) %>% data.frame(stringsAsFactors = F)
row.names(testdata) <- NULL

names(testdata) <- names

testdata$Date_start <- testdata$Date_start %>% as_date()
testdata$Date_end <- testdata$Date_end %>% as_date()
testdata

2。创建一个包含要添加数据的新数据集。

2.Creating a new data set that has the data you want to add.

id :我们使用相同的id值,因为它是按id分组的。

Date_start :我们正在创建如果有间隔,则以Date_start开头的值,否则为(空列,我们正在过滤

Date_end :与Date_end的逻辑相同。

EX :我们使用倒数第二个EX值如您所说。

id: we are using the same id value since it is grouping by id.
Date_start: we are creating the Date_start with a value if there is gap, otherwise "" (empty column, and we are filtering them out).
Date_end: Same logic for Date_end.
EX: we are using the second last EX value as you stated.

  new_data <- test_data %>% 
  group_by(id) %>% 
  mutate(Date_start1 = ifelse(Date_start-lag(Date_end) == 1,0,lag(Date_end)+1),
         Date_end1 = ifelse(Date_start-lag(Date_end) == 1,0,Date_start-1),
         EX=first(EX)) %>% 
  filter(!Date_start1 ==0) %>% 
  select(id, Date_start=Date_start1,Date_end=Date_end1,EX) %>% 
  distinct() %>% 
  ungroup()

3.由于我们要填补间隔天数,所以将其突变为数值,然后使用lubriate中的 as_date()将其转换为日期格式。

3.Since we want to fill the gap days, mutate made it into numeric value, and we are using as_date() from lubriate to convert it into date format.

new_data$Date_start <- as_date(new_data$Date_start)
new_data$Date_end <- as_date(new_data$Date_end)

4。将其与样本数据合并并按Date_state进行排列。

4.Combine it with your sample data and arrange it by Date_state.

final <- rbind(testdata,new_data) %>% data.frame() %>% arrange(Date_start)
final

您的最终结果如下。

这篇关于填写R中缺少的日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆