根据重叠的日期在数据框之间匹配值 [英] Matching values between data frames based on overlapping dates
问题描述
我目前正在处理以下数据结构:
I am currently dealing with the following data structures:
属性df:
ID Begin_A End_A Interval Value
1 5 1990-03-01 2017-03-10 1990-03-01 UTC--2017-03-10 UTC Cat1
2 10 1993-12-01 2017-12-02 1993-12-01 UTC--2017-12-02 UTC Cat2
3 5 1991-03-01 2017-03-03 1991-03-01 UTC--2017-03-03 UTC Cat3
4 10 1995-12-05 2017-12-10 1995-12-05 UTC--2017-12-10 UTC Cat4
预订df:
ID Begin_A End_A Interval
1 5 2017-03-03 2017-03-05 2017-03-03 UTC--2017-03-05 UTC
2 6 2017-05-03 2017-05-05 2017-05-03 UTC--2017-05-05 UTC
3 8 2017-03-03 2017-03-05 2017-03-03 UTC--2017-03-05 UTC
4 10 2017-12-05 2017-12-06 2017-12-05 UTC--2017-12-06 UTC
如以下博文所述:匹配以重叠的时间间隔和ID为条件的值,我打算执行以下操作数据重组:从预订中获取ID,过滤属性数据框的所有行,其中属性ID与预订ID匹配.检查哪些具有匹配属性ID的行也具有重叠的时间间隔(来自lubridate的int_overlaps).然后从值"列中获取相应的值,并将每个值打印在Attribute_value列中.
As already mentioned in the following post: Matching values conditioned on overlapping Intervals and ID , I intend to do the following data-restructuring: Take the ID from bookings, filter all rows of the attributes data frame where attributes ID matches the booking ID. Check which of the rows with matching attribute ID also have overlapping time intervals (int_overlaps from lubridate). Then take the respective value from the Value column and print each of them in the Attribute_value column.
预期结果如下:
ID Begin_A End_A Interval Attribute_value
5 2017-03-03 2017-03-05 2017-03-03 UTC--2017-03-05 UTC Cat1,Cat3
6 2017-05-03 2017-05-05 2017-05-03 UTC--2017-05-05 UTC NA
8 2017-03-03 2017-03-05 2017-03-03 UTC--2017-03-05 UTC NA
10 2017-12-05 2017-12-06 2017-12-05 UTC--2017-12-06 UTC Cat4
ycw已在此提供了部分解决此问题的方法:( https://stackoverflow.com/a/46819541/8259308).此解决方案不允许在属性数据框中的Begin_A和End_A之间存在长时间,因为使用以下命令创建了带有各个日期的向量:
ycw already provided a partial answer to this question here:(https://stackoverflow.com/a/46819541/8259308). This solution does not allow long periods between Begin_A and End_A in the attributes data frame, because a vector with individual dates is created with this command:
complete(Date = full_seq(Date, period = 1), ID) %>%
由于我的原始数据集在属性"数据框中具有大量的带有较长时间范围的观测值,因此R无法处理这些大量的观测值.我的想法是修改上述代码行以将日期的跳变减少到几个月(这也会降低精度),或者尝试一种新的方法. 以下代码产生上面显示的数据帧:
Since my original dataset has a very large amount of observations with long time frames in the Attributes data frame, R is not capable of processing these large amount of observations. My idea was to either modify the above mentioned line to reduce the jumps in dates to months ( which would also diminish the precision) or to try a new approach. The following code produces the data frames presented above:
library(lubridate)
library(tidyverse)
# Attributes data frame:
date1 <- as.Date(c('1990-3-1','1993-12-1','1991-3-1','1995-12-5'))
date2 <- as.Date(c('2017-3-10','2017-12-2','2017-3-3','2017-12-10'))
attributes <- data.frame(matrix(NA,nrow=4, ncol = 5))
names(attributes) <- c("ID","Begin_A", "End_A", "Interval", "Value")
attributes$ID <- as.numeric(c(5,10,5,10))
attributes$Begin_A <-date1
attributes$End_A <-date2
attributes$Interval <-attributes$Begin_A %--% attributes$End_A
attributes$Value<- as.character(c("Cat1","Cat2","Cat3","Cat4"))
### Bookings data frame:
date1 <- as.Date(c('2017-3-3','2017-5-3','2017-3-3','2017-12-5'))
date2 <- as.Date(c('2017-3-5','2017-5-5','2017-3-5','2017-12-6'))
bookings <- data.frame(matrix(NA,nrow=4, ncol = 4))
names(bookings) <- c("ID","Begin_A", "End_A", "Interval")
bookings$ID <- as.numeric(c(5,6,8,10))
bookings$Begin_A <-date1
bookings$End_A <-date2
bookings$Interval <-bookings$Begin_A %--% bookings$End_A
这是ycw提供的上一篇文章的解决方案:
This is the solution for the previous post provided by ycw:
library(tidyverse)
attributes2 <- attributes %>%
select(-Interval) %>%
gather(Type, Date, ends_with("_A")) %>%
select(-Type) %>%
group_by(Value) %>%
complete(Date = full_seq(Date, period = 1), ID) %>%
ungroup()
bookings2 <- bookings %>%
select(-Interval) %>%
gather(Type, Date, ends_with("_A")) %>%
select(-Type) %>%
group_by(ID) %>%
complete(Date = full_seq(Date, period = 1)) %>%
ungroup()
bookings3 <- bookings2 %>%
left_join(attributes2, by = c("ID", "Date")) %>%
group_by(ID) %>%
summarise(Attribute_value = toString(sort(unique(Value)))) %>%
mutate(Attribute_value = ifelse(Attribute_value %in% "", NA, Attribute_value))
bookings4 <- bookings %>% left_join(bookings3, by = "ID")
bookings4
ID Begin_A End_A Interval Attribute_value
1 5 2017-03-03 2017-03-05 2017-03-03 UTC--2017-03-05 UTC Cat1, Cat3
2 6 2017-05-03 2017-05-05 2017-05-03 UTC--2017-05-05 UTC <NA>
3 8 2017-03-03 2017-03-05 2017-03-03 UTC--2017-03-05 UTC <NA>
4 10 2017-12-05 2017-12-06 2017-12-05 UTC--2017-12-06 UTC Cat4
推荐答案
您可以考虑data.table
允许非等额联接",即基于>=
,>
,<=
和<
.在同一调用中,可以对RHS数据集中(i
)中的每一行都匹配(by = .EACHI
)的LHS数据集中的组执行聚合操作.
You may consider data.table
which allows for "non-equi joins", i.e. joins based on >=
, >
, <=
and <
. In the same call, aggregate operations may be performed on the groups in the LHS data set that each row in the RHS data set (i
) matches (by = .EACHI
).
d1[d2, on = .(id = id, end >= begin),
.(i.begin, i.end, val_str = toString(val)), by = .EACHI]
# id end i.begin i.end val_str
# 1: 5 2017-03-03 2017-03-03 2017-03-05 Cat3, Cat1
# 2: 6 2017-05-03 2017-05-03 2017-05-05 NA
# 3: 8 2017-03-03 2017-03-03 2017-03-05 NA
# 4: 10 2017-12-05 2017-12-05 2017-12-06 Cat4
数据准备:
Data preparation:
d1 <- data.frame(id = c(5, 10, 5, 10),
begin = as.Date(c('1990-3-1','1993-12-1','1991-3-1','1995-12-5')),
end = as.Date(c('2017-3-10','2017-12-2','2017-3-3','2017-12-10')),
val = c("Cat1", "Cat2", "Cat3", "Cat4"))
d2 <- data.frame(id = c(5, 6, 8, 10),
begin = as.Date(c('2017-3-3','2017-5-3','2017-3-3','2017-12-5')),
end = as.Date(c('2017-3-5','2017-5-5','2017-3-5','2017-12-6')))
library(data.table)
setDT(d1)
setDT(d2)
这篇关于根据重叠的日期在数据框之间匹配值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!