R-使用键然后输入近似日期加入数据框 [英] R - Join Dataframes using a key and then Approximate Dates

查看:48
本文介绍了R-使用键然后输入近似日期加入数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题

我正在尝试使用3个ID列(如果我将3个粘贴在一起,则合并为1列)合并两个数据帧,其中之一是datetime变量,并且在两个数据帧之间的变化可能长达1秒.

I am trying to merge two dataframes using 3 ID columns (Or 1 column, if I paste the 3 together), one of which is a datetime variable and can vary between the two dataframes by up to 1 second.

背景

我从带有事务记录的库中提取了两个数据框.由于某些原因,退房和入住将分别记录,而没有与之匹配的唯一交易ID".我想匹配他们. 签出"数据框具有每个已签出项目的记录,包括到期日(应归还该项目的日期). 签到"数据框具有每个签到项目的记录,包括到期日.不幸的是,由于两个原因,我很难将这些数据框合并在一起:

I have two dataframes extracted from a library with transaction records. For some reason, the check-outs and the check-ins are recorded seperately, without a unique "transaction ID" to match them. I'd like to match them. The "check-out" dataframe has a record for each item that was checked-out, including the due date (when the item should be returned). The "check-in" dataframe has a record for each item that was checked-in, including the due date. Unfortunately, I am having a hard time merging these dataframes together for two reasons:

  1. 没有唯一的事务ID来匹配表. (为什么?我不知道.)
  2. 同一笔交易的到期日期"字段最多可以相差一秒.

due_date的变化看似随机发生,因此无法确定两个Due_date等于或相差1秒的记录.否则,我可以减去(或加一秒钟)使它们相等.

The variation in due_date is occurs seemingly at random, so there isn't any way to determine for which records the two due_dates are equal or differ by 1 second. Otherwise, I could just subtract (or add) a second to make them equal.

数据

以下是我正在使用的数据的示例:

Here is a sample of the data I am working with:

library(dplyr)
library(lubridate)

check_in <- tribble(
  ~ patron_id, ~item_id, ~checked_in, ~due_date,
    "A", "Z", "2018-04-16 07:00:00", "2018-04-16 08:00:00",
    "A", "Y", "2018-04-17 07:30:01", "2018-04-17 08:30:01",
    "B", "X", "2018-04-17 07:00:01", "2018-04-17 08:00:01",
    "B", "Z", "2018-04-17 08:00:01", "2018-04-17 09:00:01",
    "B", "Z", "2018-04-09 09:00:01", "2018-04-09 10:00:01",
    "C", "V", "2018-04-09 09:00:01", "2018-04-09 10:00:01",
    "C", "X", "2018-04-09 09:00:01", "2018-04-09 10:00:01")

check_out <- tribble(
  ~ patron_id, ~item_id, ~checked_out, ~due_date,
    "A", "Z", "2018-04-16 06:00:00", "2018-04-16 08:00:01",
    "A", "Y", "2018-04-17 06:30:01", "2018-04-17 08:30:00",
    "B", "X", "2018-04-17 06:00:01", "2018-04-17 08:00:00",
    "B", "Z", "2018-04-17 07:00:01", "2018-04-17 09:00:00",
    "B", "Z", "2018-04-09 08:00:01", "2018-04-09 10:00:01",
    "C", "V", "2018-04-09 08:00:01", "2018-04-09 10:00:01",
    "C", "X", "2018-04-09 08:00:01", "2018-04-09 10:00:00")

check_in$due_date <- ymd_hms(check_in$due_date)
check_in$checked_in <- ymd_hms(check_in$checked_in)

check_out$due_date <- ymd_hms(check_out$due_date)
check_out$checked_out <- ymd_hms(check_out$checked_out)

顾客ID是签出书的人的唯一ID.项目ID是书籍的唯一ID. 签出"是指将书签出的时间. 签到"是指将书签到的时间.到期日期"是指该书的到期时间.

Patron ID is the unique ID of the person who checked out a book. The Item ID is the unique ID of the book. Checked Out is when the book was checked out. Checked In is when the book was checked in. And Due Date is when the book is due.

对于此样本数据,我将所有到期日都设置为等于结帐日期后2小时.我还设定了入住日期等于退房日期之后1小时.

For this sample data, I made all of the due-dates equal to 2 hours after the check out date. I also made the check-in dates equal to 1 hour after the check out date.

所需的输出

我想从check_in数据框中获取"checked_in"变量,并将其与check_out数据框中的相应事务进行匹配.输出将是这样的,但可能带有某种生成的事务ID:

I would like to take the "checked_in" variable from the check_in dataframe and match it to the appropriate transaction in the check_out dataframe. The output would be something like this, but perhaps with a some sort of generated transaction ID:

desired_output <- tribble(
  ~patron_id, ~item_id, ~checked_out, ~checked_in, ~due_date,
    "A", "Z", "2018-04-16 06:00:00", "2018-04-16 07:00:00", "2018-04-16 08:00:01",
    "A", "Y", "2018-04-17 06:30:01", "2018-04-17 07:30:01", "2018-04-17 08:30:00",
    "B", "X", "2018-04-17 06:00:01", "2018-04-17 07:00:01", "2018-04-17 08:00:00",
    "B", "Z", "2018-04-17 07:00:01", "2018-04-17 08:00:01", "2018-04-17 09:00:00",
    "B", "Z", "2018-04-09 08:00:01", "2018-04-09 09:00:01", "2018-04-09 10:00:01",
    "C", "V", "2018-04-09 08:00:01", "2018-04-09 09:00:01", "2018-04-09 10:00:01",
    "C", "X", "2018-04-09 08:00:01", "2018-04-09 09:00:01", "2018-04-09 10:00:00")

我尝试过的事情

ATTEMPT#1:

ATTEMPT #1:

我已尝试有条件地进行合并,如帖子中所述,并进行了以下修改:

I''ve tried to conditionally merge, as explained in this post, with the following modifications:

check_out <- check_out %>%
             mutate(transaction_id = paste(patron_id,"-",item_id,sep=""))
check_in <- check_in %>%
              mutate(transaction_id = paste(patron_id,"-",item_id,sep=""))

output <- merge(check_out, check_in, by="transaction_id")[abs(difftime(check_out$due_date, check_in$due_date, units = "secs"))<=1,]

但是这种方法(显然)不能处理相同的事务ID,并且创建的记录比实际多.

But this method doesn't handle identical transaction ID's (obviously) and creates more records than there actually are.

ATTEMPT#2:

ATTEMPT #2:

回到原始数据帧,我尝试了这篇文章中的解决方案,并进行了以下修改:

Reverting back to the original dataframes, I attempted the solution in this post, with the following modifications:

output <- cbind(check_out, check_in[ 
                  sapply(check_out$due_date, 
                    function(x) which.min(abs(difftime(x, check_in$due_date)))), ])

但是此方法不考虑交易ID",或者不考虑我用来创建某种唯一ID的两个关键变量.因此,弄错了输出.

But this method does not consider the "transaction ID", or rather, the two key variables that I am using to create some sort of unique ID. And thus, get the output wrong.

其他未成功的尝试

  1. 模糊连接,如此内容所述文章. (以及其他基于R的解决方案.)
  2. 响应,它使用过滤.
  1. Fuzzy Joins as mentioned in this article. (And the other R based solutions mentioned.)
  2. This response, which uses filtering.

不幸的是,我无法使它们正常工作.我对这些方法的工作方式没有信心,也没有产生我想要的结果.最有可能是用户错误,因为看来其他人也可以使类似的事情起作用.

Unfortunately, I wasn't able to get these to work. I wasn't confident in how the methods were working and it didn't produce what I wanted. Most likely a user error, because it seems others were able to get similar things to work.

谢谢

如果能帮助我,请先谢谢您.我倾向于使用Tidyverse提供的工具,但是我愿意使用其他工具和方法.我试图确保在搜索其他解决方案时进行了尽职调查,但是如果您发现我错过了重要帖子,请将此帖子标记为重复并以我的方式发送.

Thank you in advanced, if you are able to help me. I tend to use the tools provided by the Tidyverse, but I am open to using other tools and methods. I tried to make sure I did my due diligence when searching for other solutions, but if you find that I missed an important post, please mark this as duplicate and send that post my way.

请告知我是否可以提供其他任何信息或澄清以上任何详细信息.

Please let me know if I can provide any additional information or clarify any of the above details.

推荐答案

可以配合使用:

inner_join(check_in, check_out, by = c("patron_id", "item_id")) %>%
  filter(abs(difftime(due_date.y, due_date.x, units= "secs"))<=as.difftime(1, format = "%S", units = "secs"))

说明:简单连接+过滤时差< = 1秒的行

Explanation: simple join + filtering the rows with a time difference <= 1 second

这篇关于R-使用键然后输入近似日期加入数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆