按日期联接两个data.tables,表1中最接近的日期严格小于第二个表中的日期 [英] Join two data.tables on date, with closest date in table 1 strictly less than date in second table

查看:20
本文介绍了按日期联接两个data.tables,表1中最接近的日期严格小于第二个表中的日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从SO的其他地方偷取一个虚拟示例(

Stealing a dummy example from elsewhere on SO (Join data.table on exact date or if not the case on the nearest less than date), I'm looking to join two tables based on the first date (date in Dt1) being strictly earlier than the second date (date in Dt2).

还关闭了DataCombine解决方案的幻灯片"功能中的警告"消息,因为这可能不公平地减慢了mtotos解决方案的速度.

Also turned off the 'warning' message from the 'slide' function for the DataCombine solution, as it was probably unfairly slowing down mtotos solution.

library(data.table)

Dt1 <- read.table(text="
date      x
1/26/2010,  10  
1/25/2010,  9  
1/24/2010,  9   
1/22/2010,  7    
1/19/2010,  11", header=TRUE, stringsAsFactors=FALSE)

Dt2 <- read.table(text="
date
1/26/2010   
1/23/2010   
1/20/2010", header=TRUE, stringsAsFactors=FALSE)

所需的联接结果

   date     x  
1/26/2010 - 9 # based on closest observation strictly less than date  
1/23/2010 - 7   
1/20/2010 - 11


两个解决方案的时间

(我保留data.frame格式作为mtoto解决方案的输入,而data.table则为jangorecki解决方案的输入).


Timings of two solutions

(I keep the data.frame format for input to mtoto's solution, and data.table for jangorecki's).

solution.mtoto = function(Df1, Df2)
{
  #Full outer join of two df's
  merged <- merge(Df1, Df2, by = "date", all = T, sort=T)

  # Shifting values backwards by one using 'slide' from DataCombine
  merged <- slide(merged, Var = "x", slideBy = -1, reminder = F)

  # Inner join retaining the relevant cols
  return(merge(Df2,merged)[,-2])
}

solution.jangorecki = function(Dt1, Dt2)
{
  offset.roll.join = function(Dt1, Dt2){
    Dt2[, jndate := date - 1L] # produce join column with offset
    on.exit(Dt2[, jndate := NULL]) # cleanup join col on exit
    Dt1[Dt2, .(date = i.date, x), on = c("date" = "jndate"), roll = Inf] # do rolling join
  }
  return(offset.roll.join(Dt1, Dt2))
}

res.mtoto = sapply(1:10, FUN = function(x){system.time({solution.mtoto(Df1, Df2)})})

res.jangorecki = sapply(1:10, FUN = function(x){system.time({solution.jangorecki(Dt1, Dt2)})})


> res.mtoto[c("user.self", "sys.self"),]
           [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
user.self 0.004 0.004 0.004 0.004 0.003 0.003 0.003 0.003 0.003 0.003
sys.self  0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

> res.jangorecki[c("user.self", "sys.self"),]
           [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
user.self 0.005 0.005 0.004 0.004 0.005 0.004 0.004 0.004 0.003 0.004
sys.self  0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

编辑,在mtoto的解决方案中不小心引用了Dt1而不是Df1.现在已修复.

类似的速度(在更大的数据集上可能会更明显吗?).我的另一个问题是,我想在第二张表中返回日期.

例如,所需的结果将是:

For example, the desired result would be:

date - x - date2
1/26/2010 - 9 - 1/25/2010
1/23/2010 - 7 - 1/22/2010
1/20/2010 - 11 - 1/19/2010

推荐答案

具有 -1L 偏移量的滚动连接.

Rolling join with -1L offset.

更新2016-04-02 :使用此提交在当前版本v1.9.7中,无需创建临时列即可完成此操作.来自新闻:

Update 2016-04-02: With this commit in current devel, v1.9.7, this can be done without creating a temporary column. From NEWS:

始终可以使用前缀 x. j 中引用

x的列.当有必要将x的列也作为连接列时,此功能特别有用.这是解决#1615 的补丁.

x's columns can be referred to in j using the prefix x. at all times. This is particularly useful when it is necessary to x's column that is also a join column. This is a patch addressing #1615.

Dt2[, jndate := date - 1L]
Dt1[Dt2,
    .(date = i.date, orgdate = x.date, x),
    on = c("date" = "jndate"),
    roll = Inf]
#         date    orgdate  x
#1: 2010-01-26 2010-01-25  9
#2: 2010-01-23 2010-01-22  7
#3: 2010-01-20 2010-01-19 11


原始答案,如果您使用的是1.9.6或更早版本,则很有用.


Original answer, useful if you are on 1.9.6 or older.

library(data.table)

# data
Dt1 = fread("date      x
1/26/2010,  10  
1/25/2010,  9  
1/24/2010,  9   
1/22/2010,  7    
1/19/2010,  11")[, date := as.IDate(date, format=("%m/%d/%Y"))][]
Dt2 = fread("date
1/26/2010   
1/23/2010   
1/20/2010")[, date := as.IDate(date, format=("%m/%d/%Y"))][]

# solution
offset.roll.join = function(Dt1, Dt2){
    Dt2[, jndate := date - 1L] # produce join column with offset
    Dt1[, orgdate := date] # should not be needed after data.table#1615
    on.exit({Dt2[, jndate := NULL]; Dt1[, orgdate := NULL]}) # cleanup on exit
    Dt1[Dt2, .(date = i.date, orgdate, x), on = c("date" = "jndate"), roll = Inf] # do rolling join
}
offset.roll.join(Dt1, Dt2)
#         date    orgdate  x
#1: 2010-01-26 2010-01-25  9
#2: 2010-01-23 2010-01-22  7
#3: 2010-01-20 2010-01-19 11

这篇关于按日期联接两个data.tables,表1中最接近的日期严格小于第二个表中的日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆