根据重叠的日期在数据框之间匹配值 [英] Matching values between data frames based on overlapping dates

查看:46
本文介绍了根据重叠的日期在数据框之间匹配值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在处理以下数据结构:

I am currently dealing with the following data structures:

属性df:

  ID Begin_A      End_A        Interval                          Value
1  5 1990-03-01   2017-03-10   1990-03-01 UTC--2017-03-10 UTC    Cat1
2 10 1993-12-01   2017-12-02   1993-12-01 UTC--2017-12-02 UTC    Cat2
3  5 1991-03-01   2017-03-03   1991-03-01 UTC--2017-03-03 UTC    Cat3
4 10 1995-12-05   2017-12-10   1995-12-05 UTC--2017-12-10 UTC    Cat4

预订df:

  ID    Begin_A      End_A                       Interval
1  5 2017-03-03 2017-03-05 2017-03-03 UTC--2017-03-05 UTC
2  6 2017-05-03 2017-05-05 2017-05-03 UTC--2017-05-05 UTC
3  8 2017-03-03 2017-03-05 2017-03-03 UTC--2017-03-05 UTC
4 10 2017-12-05 2017-12-06 2017-12-05 UTC--2017-12-06 UTC

如以下博文所述:匹配以重叠的时间间隔和ID为条件的值,我打算执行以下操作数据重组:从预订中获取ID,过滤属性数据框的所有行,其中属性ID与预订ID匹配.检查哪些具有匹配属性ID的行也具有重叠的时间间隔(来自lubridate的int_overlaps).然后从值"列中获取相应的值,并将每个值打印在Attribute_value列中.

As already mentioned in the following post: Matching values conditioned on overlapping Intervals and ID , I intend to do the following data-restructuring: Take the ID from bookings, filter all rows of the attributes data frame where attributes ID matches the booking ID. Check which of the rows with matching attribute ID also have overlapping time intervals (int_overlaps from lubridate). Then take the respective value from the Value column and print each of them in the Attribute_value column.

预期结果如下:

ID  Begin_A      End_A        Interval                        Attribute_value
5   2017-03-03   2017-03-05   2017-03-03 UTC--2017-03-05 UTC  Cat1,Cat3
6   2017-05-03   2017-05-05   2017-05-03 UTC--2017-05-05 UTC  NA
8   2017-03-03   2017-03-05   2017-03-03 UTC--2017-03-05 UTC  NA
10  2017-12-05   2017-12-06   2017-12-05 UTC--2017-12-06 UTC  Cat4

ycw已在此提供了部分解决此问题的方法:( https://stackoverflow.com/a/46819541/8259308).此解决方案不允许在属性数据框中的Begin_A和End_A之间存在长时间,因为使用以下命令创建了带有各个日期的向量:

ycw already provided a partial answer to this question here:(https://stackoverflow.com/a/46819541/8259308). This solution does not allow long periods between Begin_A and End_A in the attributes data frame, because a vector with individual dates is created with this command:

complete(Date = full_seq(Date, period = 1), ID) %>%

由于我的原始数据集在属性"数据框中具有大量的带有较长时间范围的观测值,因此R无法处理这些大量的观测值.我的想法是修改上述代码行以将日期的跳变减少到几个月(这也会降低精度),或者尝试一种新的方法. 以下代码产生上面显示的数据帧:

Since my original dataset has a very large amount of observations with long time frames in the Attributes data frame, R is not capable of processing these large amount of observations. My idea was to either modify the above mentioned line to reduce the jumps in dates to months ( which would also diminish the precision) or to try a new approach. The following code produces the data frames presented above:

library(lubridate)
library(tidyverse)
# Attributes data frame:
date1 <- as.Date(c('1990-3-1','1993-12-1','1991-3-1','1995-12-5'))
date2 <- as.Date(c('2017-3-10','2017-12-2','2017-3-3','2017-12-10'))
attributes <- data.frame(matrix(NA,nrow=4, ncol = 5)) 
names(attributes) <- c("ID","Begin_A", "End_A", "Interval", "Value")
attributes$ID <- as.numeric(c(5,10,5,10))
attributes$Begin_A <-date1
attributes$End_A <-date2
attributes$Interval <-attributes$Begin_A %--% attributes$End_A
attributes$Value<- as.character(c("Cat1","Cat2","Cat3","Cat4"))

### Bookings data frame:

date1 <- as.Date(c('2017-3-3','2017-5-3','2017-3-3','2017-12-5'))
date2 <- as.Date(c('2017-3-5','2017-5-5','2017-3-5','2017-12-6'))
bookings <- data.frame(matrix(NA,nrow=4, ncol = 4)) 
names(bookings) <- c("ID","Begin_A", "End_A", "Interval")
bookings$ID <- as.numeric(c(5,6,8,10))
bookings$Begin_A <-date1
bookings$End_A <-date2
bookings$Interval <-bookings$Begin_A %--% bookings$End_A

这是ycw提供的上一篇文章的解决方案:

This is the solution for the previous post provided by ycw:

library(tidyverse)

attributes2 <- attributes %>%
  select(-Interval) %>%
  gather(Type, Date, ends_with("_A")) %>%
  select(-Type) %>%
  group_by(Value) %>%
  complete(Date = full_seq(Date, period = 1), ID) %>%
  ungroup()

bookings2 <- bookings %>%
  select(-Interval) %>%
  gather(Type, Date, ends_with("_A")) %>%
  select(-Type) %>%
  group_by(ID) %>%
  complete(Date = full_seq(Date, period = 1)) %>%
  ungroup()

bookings3 <- bookings2 %>%
  left_join(attributes2, by = c("ID", "Date")) %>%
  group_by(ID) %>%
  summarise(Attribute_value = toString(sort(unique(Value)))) %>%
  mutate(Attribute_value = ifelse(Attribute_value %in% "", NA, Attribute_value))

bookings4 <- bookings %>% left_join(bookings3, by = "ID")
bookings4
  ID    Begin_A      End_A                       Interval Attribute_value
1  5 2017-03-03 2017-03-05 2017-03-03 UTC--2017-03-05 UTC      Cat1, Cat3
2  6 2017-05-03 2017-05-05 2017-05-03 UTC--2017-05-05 UTC            <NA>
3  8 2017-03-03 2017-03-05 2017-03-03 UTC--2017-03-05 UTC            <NA>
4 10 2017-12-05 2017-12-06 2017-12-05 UTC--2017-12-06 UTC            Cat4

推荐答案

您可以考虑data.table允许非等额联接",即基于>=><=<.在同一调用中,可以对RHS数据集中(i)中的每一行都匹配(by = .EACHI)的LHS数据集中的组执行聚合操作.

You may consider data.table which allows for "non-equi joins", i.e. joins based on >=, >, <= and <. In the same call, aggregate operations may be performed on the groups in the LHS data set that each row in the RHS data set (i) matches (by = .EACHI).

d1[d2, on = .(id = id, end >= begin),
         .(i.begin, i.end, val_str = toString(val)), by = .EACHI]

#    id        end    i.begin      i.end    val_str
# 1:  5 2017-03-03 2017-03-03 2017-03-05 Cat3, Cat1
# 2:  6 2017-05-03 2017-05-03 2017-05-05         NA
# 3:  8 2017-03-03 2017-03-03 2017-03-05         NA
# 4: 10 2017-12-05 2017-12-05 2017-12-06       Cat4


数据准备:


Data preparation:

d1 <- data.frame(id = c(5, 10, 5, 10),
                 begin = as.Date(c('1990-3-1','1993-12-1','1991-3-1','1995-12-5')),
                 end = as.Date(c('2017-3-10','2017-12-2','2017-3-3','2017-12-10')),
                 val = c("Cat1", "Cat2", "Cat3", "Cat4"))

d2 <- data.frame(id = c(5, 6, 8, 10),
                 begin = as.Date(c('2017-3-3','2017-5-3','2017-3-3','2017-12-5')),
                 end = as.Date(c('2017-3-5','2017-5-5','2017-3-5','2017-12-6')))

library(data.table)
setDT(d1)
setDT(d2)

这篇关于根据重叠的日期在数据框之间匹配值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆