将重叠时间范围的数据转换为不同时间范围的数据 [英] Convert data of overlapping time ranges to data of distinct time ranges

查看:92
本文介绍了将重叠时间范围的数据转换为不同时间范围的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些数据,其中包括对特定状态适用于个人的时间范围的观察。一个人在给定的时间可以具有多种状态,或者根本没有任何统计信息(在这种情况下将没有观察值)。

  ID状态开始结束
1 A 2013-03-07 2013-04-20
1 B 2013-04-10 2013-05-10
1 C 2013-04-16 2013-07-11
1 B 2013-07-25 2013-08-08
2 A 2013-09-10 2014-04-21
2 C 2013-12-27 2014- 01-26
2 D 2014-04-28 2014-05-10
2 E 2014-05-11 2014-07-16
  ID状态开始结束
1 A 2013-03-07 2013 -04-10
1 A,B 2013-04-10 2013-04-16
1 A,B,C 2013-04-16 2013-04-20
1 B,C 2013-04-20 2013-05-10
1 C 2013-05-10 2013-07-11
1 B 2013-07-25 2013-08-08
2 A 2013- 09-10 2013-12-27
2 A,C 2013-12-27 2014-01-26
2 D 2014-04-28 2014-05-10
2 E 2014- 2014年5月11日

我的数据很大,因此我很难找到一个做到这一点的有效方法。我发现了半相关的问题,例如此处,但它们通常涉及将重叠的区域折叠成一个



我们将不胜感激。谢谢。

解决方案

  library(dplyr)

#创建所有可能的日期范围
date_breaks<-sort(c(df $ START,df $ END))

#使用所有可能的组合构建一个data.frame
df2< ;-expand.grid(START = head(date_breaks,-1),
ID =唯一(df $ ID),
STATUS =唯一(df $ STATUS),
stringsAsFactors = F)
df2 $ END<-tail(date_breaks,-1)

#加入原始数据并过滤到开始和结束的范围
df2%&%;%
inner_join(df,by = c( ID, STATUS))%> %%
过滤器(START.y< = START.x,END.y> = END.x )%>%
group_by(ID,START = START.x,END = END.x)%&%;%
summarise(STATUS = paste(unique(STATUS),崩溃=',') )%>%
select(ID,STATUS,START,END)

#小技巧:11 x 4
#组:ID,START [11]
#ID状态开始结束
#< i nt> < chr> < date> < date>
#1 1 A 2013-03-07 2013-04-10
#2 1 A,B 2013-04-10 2013-04-16
#3 1 A,B,C 2013-04-16 2013-04-20
#4 1 B,C 2013-04-20 2013-05-10
#5 1 C 2013-05-10 2013-07-11
#6 1 B 2013-07-25 2013-08-08
#7 2 A 2013-09-10 2013-12-27
#8 2 A,C 2013-12-27 2014 -01-26
#9 2 A 2014-01-26 2014-04-21
#10 2 D 2014-04-28 2014-05-10
#11 2 E 2014- 2014年5月11日

数据

  df<-read.table(text = ID STATUS START END 
1 A 2013-03-07 2013-04-20
1 B 2013-04-10 2013-05-10
1 C 2013-04-16 2013-07-11
1 B 2013-07-25 2013-08-08
2 A 2013-09-10 2014-04-21
2 C 2013-12-27 2014-01-26
2 D 2014-04-28 2014-05-10
2 E 2014-05-11 2014-07-16,
标头= T,
colClasses = c('整数','字符','日期','日期'))


I have data that consists of observations of the range of time a particular status applied for an individual. An individual can have multiple statuses at a given time, or have no stats at all (in which case there would not be an observation).

ID  STATUS START      END    
1   A      2013-03-07 2013-04-20
1   B      2013-04-10 2013-05-10
1   C      2013-04-16 2013-07-11
1   B      2013-07-25 2013-08-08
2   A      2013-09-10 2014-04-21
2   C      2013-12-27 2014-01-26
2   D      2014-04-28 2014-05-10
2   E      2014-05-11 2014-07-16

I would like to convert this into a dataframe consisting of distinct, non-overlapping periods and all of the statuses that applied during those periods. The result from the example would be as below.

ID  STATUS START      END    
1   A      2013-03-07 2013-04-10
1   A,B    2013-04-10 2013-04-16
1   A,B,C  2013-04-16 2013-04-20
1   B,C    2013-04-20 2013-05-10
1   C      2013-05-10 2013-07-11
1   B      2013-07-25 2013-08-08
2   A      2013-09-10 2013-12-27
2   A,C    2013-12-27 2014-01-26
2   D      2014-04-28 2014-05-10
2   E      2014-05-11 2014-07-16

My data is fairly large, and so I struggling to find an efficient way to do this. I've found semi-related problems, such as here, but they typically involve collapsing overlapping regions into a super region, rather than breaking them out into distinct subregions.

Any help or ideas would be appreciated. Thanks.

解决方案

library(dplyr)

# Create all the possible date ranges
date_breaks <- sort(c(df$START, df$END))

# Build a data.frame with all possible combinations
df2 <- expand.grid(START = head(date_breaks, -1),
                   ID = unique(df$ID),
                   STATUS = unique(df$STATUS),
                   stringsAsFactors = F)
df2$END <- tail(date_breaks, -1)

# Join in original data and filter to where the start and end are in range
df2 %>%
    inner_join(df, by = c("ID", "STATUS")) %>%
    filter(START.y <= START.x, END.y >= END.x) %>%
    group_by(ID, START = START.x, END = END.x) %>%
    summarise(STATUS = paste(unique(STATUS), collapse = ', ')) %>% 
    select(ID, STATUS, START, END)

#  A tibble: 11 x 4
#  Groups:   ID, START [11]
#       ID STATUS  START      END       
#    <int> <chr>   <date>     <date>    
#  1     1 A       2013-03-07 2013-04-10
#  2     1 A, B    2013-04-10 2013-04-16
#  3     1 A, B, C 2013-04-16 2013-04-20
#  4     1 B, C    2013-04-20 2013-05-10
#  5     1 C       2013-05-10 2013-07-11
#  6     1 B       2013-07-25 2013-08-08
#  7     2 A       2013-09-10 2013-12-27
#  8     2 A, C    2013-12-27 2014-01-26
#  9     2 A       2014-01-26 2014-04-21
# 10     2 D       2014-04-28 2014-05-10
# 11     2 E       2014-05-11 2014-07-16

Data

df <- read.table(text = "ID  STATUS START      END    
                         1   A      2013-03-07 2013-04-20
                         1   B      2013-04-10 2013-05-10
                         1   C      2013-04-16 2013-07-11
                         1   B      2013-07-25 2013-08-08
                         2   A      2013-09-10 2014-04-21
                         2   C      2013-12-27 2014-01-26
                         2   D      2014-04-28 2014-05-10
                         2   E      2014-05-11 2014-07-16",
                 header = T,
                 colClasses = c('integer', 'character', 'Date', 'Date'))

这篇关于将重叠时间范围的数据转换为不同时间范围的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆