如何从两个不同的数据框和子集中查找重叠的日期 [英] how to find dates that overlap from two different dataframes and subset

查看:66
本文介绍了如何从两个不同的数据框和子集中查找重叠的日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用数据框A中的日期来查找该日期后180天内的任何日期,以选择数据框B中具有匹配ID的行。

I would like to use a date from dataframe A to find any dates within 180 days of this date to select rows in dataframe B, with matching ID's.

例如。

Dataframe A
ID  Date A
42  2012-07-21
42  2013-04-12
167 2009-04-27
167 2010-04-19
105 2010-12-16
105 2012-01-05


Dataframe B
ID Date B
12 2016-09-08
35 2008-02-02
42 2012-01-09
42 2013-03-13
167 2010-08-02
105 2010-11-26
105 2011-08-12
105 2011-11-11
105 2013-03-15
105 2013-09-13

我想创建一个数据框提供最接近的日期组合,并确保序列中至少有3个日期B。因此,日期A为参考日期,并且第一个日期B必须在日期A的180 +/-之内,并且至少要有两个后续日期。
如果有两个或更多个潜在的日期A和B组合,我将选择保留至少3个日期Bs的组合作为首选项。

I would like to create a dataframe that provides the closest combination of dates as well as ensuring that there are a minimum of 3 Date B's in the sequence. So date A is the reference date, and the first date B needs to be within 180+/- of date A, as well as have at least two subsequent dates. If there are two ore more potential date A and B combinations, I would pick the combination that preserves a minimum of 3 date Bs as the preference.

ID  Date A        Date B
105 2012-01-05    2011-11-11
105 2012-01-05    2013-03-15
105 2012-01-05    2013-09-13


推荐答案

如果您有大数据,我建议使用data.tables 滚动连接

If you have a big data, I would suggest using data.tables rolling join instead

假设这些是您的数据集

dfa <- read.table(text = "ID  Date
                  42  '2012-07-21'
                  42  '2013-04-12'", header = TRUE)

dfb <- read.table(text = "ID Date
                  12 '2016-09-08'
                  35 '2008-02-02'
                  42 '2012-01-09'
                  42 '2013-03-13'", header = TRUE)

我们将它们转换为data.tables并转换为 Date 列到 IDate

We will convert them to data.tables and convert the Date column to IDate class

library(data.table) #1.9.8+
setDT(dfa)[, Date := as.IDate(Date)]
setDT(dfb)[, Date := as.IDate(Date)]

然后,只需加入即可(两种方式都可以进行滚动加入) )

Then, simply join away (you can do the rolling join both ways)

# You can perform another rolling join for `roll = -180` too
indx <- dfb[
            dfa, # Per each row in dfa find a match in dfb
            on = .(ID, Date), # The columns to join by
            roll = 180, # Rolling window, can join again on -180 afterwards
            which = TRUE, # Return the row index within `dfb` that been matched
            mult = "first", # Multiple match handling- take only the first match
            nomatch = 0L # Don't return unmatched indexes (NAs)
           ]

dfb[indx]
#    ID       Date
# 1: 42 2013-03-13






实现此目标的另一种方法是使用数据。 Date + -180 (手动创建)列上的表 non-equi 连接功能


An alternative way achieving this, is to use data.tables non-equi join feature on Date +-180 (manually created) columns

# Create range columns
dfa[, c("Date_m_180", "Date_p_180") := .(Date - 180L, Date + 180L)]

# Join away
indx <- dfb[dfa, 
            on = .(ID, Date >= Date_m_180, Date <= Date_p_180), 
            which = TRUE, 
            mult = "first",
            nomatch = 0L]
dfb[indx]
#    ID       Date
# 1: 42 2013-03-13

两个方法都应该几乎立即处理大型数据集

Both methods should handle large data sets almost instantly

这篇关于如何从两个不同的数据框和子集中查找重叠的日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆