基于相似但不准确的日期进行合并 [英] Merge based on similar but not exact dates
问题描述
我有两个数据框.我想仅在 id 相同且 VisitDate 相同的情况下合并它们,最多不超过三天.合并很简单,但如何指定日期范围而不是确切的日期合并?
I have two dataframes. I would like to merge them only where the id is the same and the VisitDate is the same by no more that three days. Merging is simple but how can I specify the date range rather than exact date merge?
这是一个示例:
df1:
structure(list(V1 = structure(c(5L, 1L, 1L, 2L, 3L, 3L, 3L, 4L
), .Label = c("1", "2", "3", "4", "Id"), class = "factor"), V2 = structure(c(7L,
5L, 6L, 5L, 3L, 4L, 2L, 1L), .Label = c("2012-01-02", "2012-02-03",
"2012-02-14", "2012-03-06", "2012-05-23", "2014-07-13", "VisitDate"
), class = "factor"), V3 = structure(c(8L, 2L, 4L, 5L, 1L, 6L,
7L, 3L), .Label = c("12", "2", "22", "23", "33", "43", "54",
"Another column"), class = "factor")), .Names = c("V1", "V2",
"V3"), class = "data.frame", row.names = c(NA, -8L))
df2:
structure(list(V1 = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("1",
"2", "3", "4", "Id"), class = "factor"), V2 = structure(c(5L,
4L, 3L, 2L, 1L), .Label = c("2012-01-08", "2012-02-16", "2012-05-25",
"2012-07-15", "VisitDate"), class = "factor"), V3 = structure(c(5L,
3L, 1L, 4L, 2L), .Label = c("22", "33", "43", "64", "Another column"
), class = "factor")), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA,
-5L))
输出应该是:
Id VisitDate.df1 col.2f1 VisitDate.df2 col.df2
1 2014-07-13 23 2012-07-15 43
2 2012-05-23 33 2012-05-25 22
3 2012-02-14 12 2012-02-16 64
推荐答案
如果你的数据不是太大,你可以简单的加入id
然后filter到关闭不超过 3 天的行.
If your data is not too large, you can simple join on id
and then filter down to the rows that are off by no more than 3 days.
例如在tidyverse
框架下:
library(tidyverse)
df1 = structure(list(Id = c(1, 1, 2, 3, 3, 3, 4),
VisitDate = structure(c(15483, 16264, 15483, 15384, 15405, 15373, 15341), class = "Date"),
Column = c(2, 4, 5, 1, 6, 7, 3)),
.Names = c("Id", "VisitDate", "Column"),
row.names = 1:7,
class = "data.frame")
df2 = structure(list(Id = c(1, 2, 3, 4),
VisitDate = structure(c(15536, 15485, 15386, 15347), class = "Date"),
Column = c(3, 1, 4, 2)),
.Names = c("Id", "VisitDate", "Column"),
row.names = 1:4,
class = "data.frame")
df1 %>%
left_join(df2, by = "Id", suffix = c(".df1", ".df2")) %>%
filter(abs(VisitDate.df1 - VisitDate.df2) <= 3)
#> Id VisitDate.df1 Column.df1 VisitDate.df2 Column.df2
#> 1 2 2012-05-23 5 2012-05-25 1
#> 2 3 2012-02-14 1 2012-02-16 4
另一种方法是复制行数据,可能在df1
中.如果您有多个日期出现相同的 ID,这可能会更有效.
An alternative approach is to replicate rows of your data, perhaps in df1
. If you have the same ID occurring across multiple dates, this might be more efficient.
df1 %>%
mutate(date = map(VisitDate, function(x){seq(x - 3, x + 3, by = 1)})) %>%
unnest(date) %>%
inner_join(df2, by = c("Id", "date" = "VisitDate"), suffix = c(".df1", ".df2"))
#> Id VisitDate Column.df1 date Column.df2
#> 1 2 2012-05-23 5 2012-05-25 1
#> 2 3 2012-02-14 1 2012-02-16 4
这篇关于基于相似但不准确的日期进行合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!