R-加快近似日期匹配. idata.frame? [英] R - Speeding up approximate date match. idata.frame?
问题描述
我正在努力有效地在两个数据框之间执行关闭"日期匹配.这个问题探索了使用plyr
包中的idata.frame
的解决方案,但是我也对其他建议的解决方案感到满意.
I am struggling to efficiently perform a "close" date match between two data frames. This question explores a solution using idata.frame
from the plyr
package, but I would be very happy with other suggested solutions as well.
这是两个数据帧的非常简单的版本:
Here is a very simplistic version of the two data frames:
sampleticker<-data.frame(cbind(ticker=c("A","A","AA","AA"),
date=c("2005-1-25","2005-03-30","2005-02-15","2005-04-21")))
sampleticker$date<-as.Date(sampleticker$date,format="%Y-%m-%d")
samplereport<-data.frame(cbind(ticker=c("A","A","A","AA","AA","AA"),
rdate=c("2005-2-15","2005-03-15","2005-04-15",
"2005-03-01","2005-04-20","2005-05-01")))
samplereport$rdate<-as.Date(samplereport$rdate,format="%Y-%m-%d")
在实际数据中,sampleticker
超过30,000行(40列),samplereport
接近300,000行(25列).
In the actual data, sampleticker
is over 30,000 rows with 40 columns, and samplereport
almost 300,000 rows with 25 columns.
我想做的是合并两个数据帧,以使sampleticker
中的每一行都与samplereport
中最接近的日期匹配组合在一起,该匹配在sampleticker
中的日期之后出现.过去,我通过对代码行字段进行简单合并,升序排序,然后选择代码行和日期的唯一组合,解决了类似的问题.但是,由于该数据集的大小,合并非常迅速.
What I would like to do is to merge the two data frames so that each row in sampleticker
is combined with the closest date match in samplereport
which occurs AFTER the date in sampleticker
. I have solved similar problems in the past by doing a simple merge on the ticker field, sorting ascending, and then selecting unique combinations of ticker and date. However, due to the size of this dataset, the merge blows up extremely quickly.
据我所知,merge
不允许这种近似匹配.我已经看到一些使用findInterval
的解决方案,但是由于日期之间的距离会有所不同,因此我不确定是否可以指定一个适用于所有行的间隔.
As near as I can tell, merge
does not allow this sort of approximate matching. I have seen some solutions which use findInterval
, but since the distance between the dates will vary, I am not sure that I can specify an interval that will work for all rows.
在此处之后,我已经写了以下代码在每一行上使用adply
并执行联接:
Following another post here, I have written the following code to use adply
on each row and to perform the join:
library(plyr)
merge<-adply(sampleticker,1,function(x){
y<-subset(samplereport,ticker %in% x$ticker & rdate > x$date)
y[which.min(y$rdate),]
}))
这很好用:对于示例数据,我得到了下面的内容,这正是我想要的.
This works quite nicely: for the sample data, I get the below, which is what I want.
date ticker rdate
1 2005-01-25 A 2005-02-15
2 2005-03-30 A 2005-04-15
3 2005-02-15 AA 2005-03-01
4 2005-04-21 AA 2005-05-01
但是,由于该代码执行了30,000+次子集操作,所以它非常慢:我在不杀死它的情况下将上述查询运行了一天以上.
However, since the code performs 30,000+ subsetting operations, it is extremely slow: I ran the above query for more than a day before finally killing it.
我在此处看到plyr 1.0具有结构idata.frame
,它通过引用调用数据帧,从而大大加快了子设置操作.但是,我无法使以下代码正常工作:
I see here that plyr 1.0 has a structure, idata.frame
, which calls the dataframe by reference, dramatically speeding up the subsetting operation. However, I cannot get the following code to work:
isamplereport<-idata.frame(samplereport)
adply(sampleticker,1,function(x){
y<-subset(isamplereport,isamplereport$ticker %in% x$ticker &
isamplereport$rdate > x$date)
y[which.min(y$rdate),]
})
我得到了错误
Error in list_to_dataframe(res, attr(.data, "split_labels")) :
Results must be all atomic, or all data frames
这对我来说很有意义,因为该操作返回一个idata.frame
(我认为).但是,将最后一行更改为:
This makes sense to me, since the operation returns an idata.frame
(I assume). However, changing the last line to:
as.data.frame(y[which.min(y$rdate),])
还会引发错误:
Error in `[.data.frame`(x$`_data`, x$`_rows`, x$`_cols`) :
undefined columns selected.
请注意,在普通的samplereport
上调用as.data.frame
会按预期返回原始数据帧.
Note that calling as.data.frame
on the plain old samplereport
returns the original data frame, as expected.
我知道idata.frame
是实验性的,因此我不一定希望它能正常工作.但是,如果有人对如何解决此问题有想法,我将不胜感激.或者,如果有人可以提出一种完全不同的方法来更有效地运行,那就太好了.
I know that idata.frame
is experimental, so I didn't necessarily expect it to work properly. However, if anyone has an idea on how to fix this, I would appreciate it. Alternately, if anyone could suggest a completely different approach that runs more efficiently, that would be fantastic.
马特
更新 Data.table是实现此目的的正确方法.见下文.
UPDATE Data.table is the right way to go about this. See below.
推荐答案
由于Matthew Dowle以及他在data.table中向后和向前滚动的能力的增加,现在执行此合并要简单得多. /p>
Thanks to Matthew Dowle and his addition of the ability to roll backwards as well as forwards in data.table, it is now much simpler to perform this merge.
ST <- data.table(sampleticker)
SR <- data.table(samplereport)
setkey(ST,ticker,date)
SR[,mergerdate:=rdate]
setkey(SR,ticker,mergerdate)
merge<-SR[ST,roll=-Inf]
setnames(merge,"mergerdate","date")
# ticker date rdate
# 1: A 2005-01-25 2005-02-15
# 2: A 2005-03-30 2005-04-15
# 3: AA 2005-02-15 2005-03-01
# 4: AA 2005-04-21 2005-05-01
这篇关于R-加快近似日期匹配. idata.frame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!