R-加快近似日期匹配. idata.frame? [英] R - Speeding up approximate date match. idata.frame?

查看:93
本文介绍了R-加快近似日期匹配. idata.frame?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在努力有效地在两个数据框之间执行关闭"日期匹配.这个问题探索了使用plyr包中的idata.frame的解决方案,但是我也对其他建议的解决方案感到满意.

I am struggling to efficiently perform a "close" date match between two data frames. This question explores a solution using idata.frame from the plyr package, but I would be very happy with other suggested solutions as well.

这是两个数据帧的非常简单的版本:

Here is a very simplistic version of the two data frames:

sampleticker<-data.frame(cbind(ticker=c("A","A","AA","AA"),
  date=c("2005-1-25","2005-03-30","2005-02-15","2005-04-21")))
sampleticker$date<-as.Date(sampleticker$date,format="%Y-%m-%d")

samplereport<-data.frame(cbind(ticker=c("A","A","A","AA","AA","AA"),
  rdate=c("2005-2-15","2005-03-15","2005-04-15",
  "2005-03-01","2005-04-20","2005-05-01")))
samplereport$rdate<-as.Date(samplereport$rdate,format="%Y-%m-%d")

在实际数据中,sampleticker超过30,000行(40列),samplereport接近300,000行(25列).

In the actual data, sampleticker is over 30,000 rows with 40 columns, and samplereport almost 300,000 rows with 25 columns.

我想做的是合并两个数据帧,以使sampleticker中的每一行都与samplereport中最接近的日期匹配组合在一起,该匹配在sampleticker中的日期之后出现.过去,我通过对代码行字段进行简单合并,升序排序,然后选择代码行和日期的唯一组合,解决了类似的问题.但是,由于该数据集的大小,合并非常迅速.

What I would like to do is to merge the two data frames so that each row in sampleticker is combined with the closest date match in samplereport which occurs AFTER the date in sampleticker. I have solved similar problems in the past by doing a simple merge on the ticker field, sorting ascending, and then selecting unique combinations of ticker and date. However, due to the size of this dataset, the merge blows up extremely quickly.

据我所知,merge不允许这种近似匹配.我已经看到一些使用findInterval的解决方案,但是由于日期之间的距离会有所不同,因此我不确定是否可以指定一个适用于所有行的间隔.

As near as I can tell, merge does not allow this sort of approximate matching. I have seen some solutions which use findInterval, but since the distance between the dates will vary, I am not sure that I can specify an interval that will work for all rows.

此处之后,我已经写了以下代码在每一行上使用adply并执行联接:

Following another post here, I have written the following code to use adply on each row and to perform the join:

library(plyr)
merge<-adply(sampleticker,1,function(x){
  y<-subset(samplereport,ticker %in% x$ticker & rdate > x$date)
  y[which.min(y$rdate),]
  }))

这很好用:对于示例数据,我得到了下面的内容,这正是我想要的.

This works quite nicely: for the sample data, I get the below, which is what I want.

   date       ticker      rdate
 1 2005-01-25  A          2005-02-15
 2 2005-03-30  A          2005-04-15
 3 2005-02-15  AA         2005-03-01
 4 2005-04-21  AA         2005-05-01

但是,由于该代码执行了30,000+次子集操作,所以它非常慢:我在不杀死它的情况下将上述查询运行了一天以上.

However, since the code performs 30,000+ subsetting operations, it is extremely slow: I ran the above query for more than a day before finally killing it.

我在此处看到plyr 1.0具有结构idata.frame ,它通过引用调用数据帧,从而大大加快了子设置操作.但是,我无法使以下代码正常工作:

I see here that plyr 1.0 has a structure, idata.frame, which calls the dataframe by reference, dramatically speeding up the subsetting operation. However, I cannot get the following code to work:

isamplereport<-idata.frame(samplereport)
adply(sampleticker,1,function(x){
  y<-subset(isamplereport,isamplereport$ticker %in% x$ticker & 
    isamplereport$rdate > x$date)
  y[which.min(y$rdate),]
})

我得到了错误

Error in list_to_dataframe(res, attr(.data, "split_labels")) : 
Results must be all atomic, or all data frames

这对我来说很有意义,因为该操作返回一个idata.frame(我认为).但是,将最后一行更改为:

This makes sense to me, since the operation returns an idata.frame (I assume). However, changing the last line to:

as.data.frame(y[which.min(y$rdate),]) 

还会引发错误:

Error in `[.data.frame`(x$`_data`, x$`_rows`, x$`_cols`) : 
undefined columns selected.

请注意,在普通的samplereport上调用as.data.frame会按预期返回原始数据帧.

Note that calling as.data.frame on the plain old samplereport returns the original data frame, as expected.

我知道idata.frame是实验性的,因此我不一定希望它能正常工作.但是,如果有人对如何解决此问题有想法,我将不胜感激.或者,如果有人可以提出一种完全不同的方法来更有效地运行,那就太好了.

I know that idata.frame is experimental, so I didn't necessarily expect it to work properly. However, if anyone has an idea on how to fix this, I would appreciate it. Alternately, if anyone could suggest a completely different approach that runs more efficiently, that would be fantastic.

马特

更新 Data.table是实现此目的的正确方法.见下文.

UPDATE Data.table is the right way to go about this. See below.

推荐答案

由于Matthew Dowle以及他在data.table中向后和向前滚动的能力的增加,现在执行此合并要简单得多. /p>

Thanks to Matthew Dowle and his addition of the ability to roll backwards as well as forwards in data.table, it is now much simpler to perform this merge.

ST <- data.table(sampleticker)
SR <- data.table(samplereport)
setkey(ST,ticker,date)
SR[,mergerdate:=rdate]
setkey(SR,ticker,mergerdate)
merge<-SR[ST,roll=-Inf]
setnames(merge,"mergerdate","date")

#    ticker       date      rdate
# 1:      A 2005-01-25 2005-02-15
# 2:      A 2005-03-30 2005-04-15
# 3:     AA 2005-02-15 2005-03-01
# 4:     AA 2005-04-21 2005-05-01

这篇关于R-加快近似日期匹配. idata.frame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆