基于相似但不准确的日期进行合并 [英] Merge based on similar but not exact dates

查看:29
本文介绍了基于相似但不准确的日期进行合并的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框.我想仅在 id 相同且 VisitDate 相同的情况下合并它们,最多不超过三天.合并很简单,但如何指定日期范围而不是确切的日期合并?

I have two dataframes. I would like to merge them only where the id is the same and the VisitDate is the same by no more that three days. Merging is simple but how can I specify the date range rather than exact date merge?

这是一个示例:

df1:

structure(list(V1 = structure(c(5L, 1L, 1L, 2L, 3L, 3L, 3L, 4L
), .Label = c("1", "2", "3", "4", "Id"), class = "factor"), V2 = structure(c(7L, 
5L, 6L, 5L, 3L, 4L, 2L, 1L), .Label = c("2012-01-02", "2012-02-03", 
"2012-02-14", "2012-03-06", "2012-05-23", "2014-07-13", "VisitDate"
), class = "factor"), V3 = structure(c(8L, 2L, 4L, 5L, 1L, 6L, 
7L, 3L), .Label = c("12", "2", "22", "23", "33", "43", "54", 
"Another column"), class = "factor")), .Names = c("V1", "V2", 
"V3"), class = "data.frame", row.names = c(NA, -8L))

df2:

    structure(list(V1 = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("1", 
"2", "3", "4", "Id"), class = "factor"), V2 = structure(c(5L, 
4L, 3L, 2L, 1L), .Label = c("2012-01-08", "2012-02-16", "2012-05-25", 
"2012-07-15", "VisitDate"), class = "factor"), V3 = structure(c(5L, 
3L, 1L, 4L, 2L), .Label = c("22", "33", "43", "64", "Another column"
), class = "factor")), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA, 
-5L))

输出应该是:

Id      VisitDate.df1        col.2f1    VisitDate.df2    col.df2 
1       2014-07-13              23      2012-07-15       43
2       2012-05-23              33      2012-05-25       22
3       2012-02-14              12      2012-02-16       64

推荐答案

如果你的数据不是太大,你可以简单的加入id然后filter到关闭不超过 3 天的行.

If your data is not too large, you can simple join on id and then filter down to the rows that are off by no more than 3 days.

例如在tidyverse框架下:

library(tidyverse)

df1 = structure(list(Id = c(1, 1, 2, 3, 3, 3, 4), 
                     VisitDate = structure(c(15483, 16264, 15483, 15384, 15405, 15373, 15341), class = "Date"),
                     Column = c(2, 4, 5, 1, 6, 7, 3)), 
                .Names = c("Id", "VisitDate", "Column"), 
                row.names = 1:7, 
                class = "data.frame")

df2 = structure(list(Id = c(1, 2, 3, 4), 
                     VisitDate = structure(c(15536, 15485, 15386, 15347), class = "Date"), 
                     Column = c(3, 1, 4, 2)), 
                .Names = c("Id", "VisitDate", "Column"), 
                row.names = 1:4, 
                class = "data.frame")


df1 %>%
    left_join(df2, by = "Id", suffix = c(".df1", ".df2")) %>%
    filter(abs(VisitDate.df1 - VisitDate.df2) <= 3)
#>   Id VisitDate.df1 Column.df1 VisitDate.df2 Column.df2
#> 1  2    2012-05-23          5    2012-05-25          1
#> 2  3    2012-02-14          1    2012-02-16          4

另一种方法是复制行数据,可能在df1中.如果您有多个日期出现相同的 ID,这可能会更有效.

An alternative approach is to replicate rows of your data, perhaps in df1. If you have the same ID occurring across multiple dates, this might be more efficient.

df1 %>%
    mutate(date = map(VisitDate, function(x){seq(x - 3, x + 3, by = 1)})) %>%
    unnest(date) %>%
    inner_join(df2, by = c("Id", "date" = "VisitDate"), suffix = c(".df1", ".df2"))
#>   Id  VisitDate Column.df1       date Column.df2
#> 1  2 2012-05-23          5 2012-05-25          1
#> 2  3 2012-02-14          1 2012-02-16          4

这篇关于基于相似但不准确的日期进行合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆