如何基于以字符串形式写入的相似但不准确的时间变量合并两个数据集.使用R? [英] How to merge two datasets based on similar but not exact time variable written in string. using R?

查看:57
本文介绍了如何基于以字符串形式写入的相似但不准确的时间变量合并两个数据集.使用R?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据集要合并,看起来像这样

I have two datasets I would like to merge that looks like this

df1
ID|date    |       time|
 1 04/06/21   "05:02:06"
 2 05/03/21   "04:12:11"
 3 02/02/20   "03:02:10"
 4 09/09/20   "09:12:14"
 5 02/02/21   "15:18:20" 
 6 04/04/21   "14:00:00"  

df2
2ID|date    |       time|
 1 04/06/21   "05:12:06"
 2 05/03/21   "04:08:11"
 3 02/02/20   "03:09:10"
 4 09/09/20   "09:12:14"
 5 02/02/21   "15:18:20" 
 6 04/04/21   "15:00:00"

通常如果我运行基于完全匹​​配的脚本

Normally If I run a script based on exact match

df3 <- df2 %>% left_join(df1, by=c("incident_date","incident_time"))

我会得到

ID|    date|time    |2ID
1  04/06/21 "05:02:06" 
2  05/03/21 "04:12:11"
3  02/02/20 "03:02:10"
4  09/09/20 "09:12:14" 4
5  02/02/21 "15:18:20" 5
6  12/14/22 "14:00:00"

请注意,我只会匹配四分之二的值,但是我希望匹配时间接近的四分之三的其余三个变量.我希望脚本在 45 分钟内做一个给予或接受,最终看起来像这样

Note I will only match two out of four however the remaining three out of four variables that are close in time I would like to match. I would like the script to do a give or take by 45 minutes which would end up looking like this

ID|    date|time    |2ID
1  04/06/21 "05:02:06" 1
2  05/03/21 "04:12:11" 2
3  02/02/20 "03:02:10" 3
4  09/09/20 "09:12:14" 4
5  02/02/21 "15:18:20" 5
6  12/14/22 "14:00:00"

我试图根据较早的堆栈溢出问题来做这样的事情,但它无法工作.有谁知道如何做到这一点

I attempted to do something like this based off an earlier stack overflow question in regards to date, but it was not able to work. Would anyone know how to do this

来源:基于相似但不准确的日期进行合并

df3< - df1%>%
left_join(df2, by=c("incident_date"), suffix 
= c(".df1", ".df2")) %>%
filter(abs({strptime(df1$incident_time,format="%H:%M:%S")}.df1 - 
{strptime(df2$incident_time,format="%H:%M:%S")}.df2) <= strptime("00:45:00",format="%H:%M:%S))

推荐答案

我认为 fuzzyjoin 包最适合这个.

I think the fuzzyjoin package is best for this.

我将向两个帧添加一个 $tm (POSIXct) 列,因为这是获得清晰的计算差异"所必需的.(以秒为单位).

I'll add a $tm (POSIXct) column to both frames, since that is needed to get a clear "calculated difference" (in seconds).

df1$tm <- as.POSIXct(paste(df1$date, df1$time), format="%m/%d/%Y %H:%M:%S")
df2$tm <- as.POSIXct(paste(df2$date, df2$time), format="%m/%d/%Y %H:%M:%S")
fuzzyjoin::difference_left_join(df1, df2, by = "tm", max_dist = 45*60)
#   ID.x   date.x   time.x                tm.x ID.y   date.y   time.y                tm.y
# 1    1 04/06/21 05:02:06 0021-04-06 05:02:06    1 04/06/21 05:12:06 0021-04-06 05:12:06
# 2    2 05/03/21 04:12:11 0021-05-03 04:12:11    2 05/03/21 04:08:11 0021-05-03 04:08:11
# 3    3 02/02/20 03:02:10 0020-02-02 03:02:10    3 02/02/20 03:09:10 0020-02-02 03:09:10
# 4    4 09/09/20 09:12:14 0020-09-09 09:12:14    4 09/09/20 09:12:14 0020-09-09 09:12:14
# 5    5 02/02/21 15:18:20 0021-02-02 15:18:20    5 02/02/21 15:18:20 0021-02-02 15:18:20
# 6    6 04/04/21 14:00:00 0021-04-04 14:00:00   NA     <NA>     <NA>                <NA>

显然需要大量的名称清理,这个怎么样:

Obviously there's a bunch of name cleanup required, how about this:

fuzzyjoin::difference_left_join(df1, df2[,c("ID","tm")], by = "tm", max_dist = 45*60) %>%
  select(ID = ID.x, date, time, ID2 = ID.y)
#   ID     date     time ID2
# 1  1 04/06/21 05:02:06   1
# 2  2 05/03/21 04:12:11   2
# 3  3 02/02/20 03:02:10   3
# 4  4 09/09/20 09:12:14   4
# 5  5 02/02/21 15:18:20   5
# 6  6 04/04/21 14:00:00  NA

注意:可以找到多个匹配项(如果多个事件在 45 分钟内),因此您可能需要添加分组过滤器:

Note: it is possible to find multiple matches (if multiple incidents are within 45 minutes), so you may need to add a grouped-filter:

... %>%
  group_by(ID.x) %>%
  filter(which.min(abs(tm.x - tm.y)))

(需要在我重命名和删除 tm.* 字段之前完成)

(needs to be done before my renaming and removal of the tm.* fields)

数据

df1 <- structure(list(ID = 1:6, date = c("04/06/21", "05/03/21", "02/02/20", "09/09/20", "02/02/21", "04/04/21"), time = c("05:02:06", "04:12:11", "03:02:10", "09:12:14", "15:18:20", "14:00:00")), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(ID = 1:6, date = c("04/06/21", "05/03/21", "02/02/20", "09/09/20", "02/02/21", "04/04/21"), time = c("05:12:06", "04:08:11", "03:09:10", "09:12:14", "15:18:20", "15:00:00")), class = "data.frame", row.names = c(NA, -6L))

这篇关于如何基于以字符串形式写入的相似但不准确的时间变量合并两个数据集.使用R?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆