加快在大型R数据框中搜索索引的速度 [英] Speeding up searching for indices within a Large R Data Frame
问题描述
这看起来像一个无害的简单问题,但是执行时间很长。
This may look like an innocuously simple problem but it takes a very long time to execute. Any ideas for speeding it up or vectorization etc. would be greatly appreciated.
我有一个具有500万行和50列的R数据框: OriginalDataFrame
I have a R data frame with 5 million rows and 50 columns : OriginalDataFrame
该帧的索引列表: IndexList
(55000 [ numIndex
]唯一索引)
A list of Indices from that Frame : IndexList
(55000 [numIndex
] unique indices)
它是一个时间序列,因此有约500万行用于55K唯一索引。
Its a time series so there are ~ 5 Million rows for 55K unique indices.
OriginalDataFrame
已由 dataIndex
进行排序。 IndexList
中的所有索引都不在 OriginalDataFrame
中。现在的任务是找到存在的索引,并构建一个新的数据框: FinalDataFrame
The OriginalDataFrame
has been ordered by dataIndex
. All the indices in IndexList
are not present in OriginalDataFrame
. The task is to find the indices that are present, and construct a new data frame : FinalDataFrame
当前我是使用 library(foreach)
运行此代码:
Currently I am running this code using library(foreach)
:
FinalDataFrame <- foreach (i=1:numIndex, .combine="rbind") %dopar% {
OriginalDataFrame[(OriginalDataFrame$dataIndex == IndexList[i]),]
}
我在具有24个内核和128GB RAM的计算机上运行此程序,但这大约需要6个小时才能完成。
I run this on a machine with 24 cores and 128GB RAM and yet this takes around 6 hours to complete.
我是不是在做些愚蠢的事情,或者R中有更好的方法做到这一点?
Am I doing something exceedingly silly or are there better ways in R to do this?
推荐答案
这里有一个基准,用于比较data.table和data.frame。如果您知道这种情况下的特殊数据表调用,它的速度大约是原来的7倍,而忽略了建立索引的费用(该索引相对较小,通常会在多次调用中摊销)。如果您不知道特殊的语法,只会更快一点。 (请注意,问题的大小比原始大小要小一些,以便于研究)
Here's a little benchmark comparing data.table to data.frame. If you know the special data table invocation for this case, it's about 7x faster, ignoring the cost of setting up the index (which is relatively small, and would typically be amortised across multiple calls). If you don't know the special syntax, it's only a little faster. (Note the problem size is a little smaller than the original to make it easier to explore)
library(data.table)
library(microbenchmark)
options(digits = 3)
# Regular data frame
df <- data.frame(id = 1:1e5, x = runif(1e5), y = runif(1e5))
# Data table, with index
dt <- data.table(df)
setkey(dt, "id")
ids <- sample(1e5, 1e4)
microbenchmark(
df[df$id %in% ids , ], # won't preserve order
df[match(ids, df$id), ],
dt[id %in% ids, ],
dt[match(ids, id), ],
dt[.(ids)]
)
# Unit: milliseconds
# expr min lq median uq max neval
# df[df$id %in% ids, ] 13.61 13.99 14.69 17.26 53.81 100
# df[match(ids, df$id), ] 16.62 17.03 17.36 18.10 21.22 100
# dt[id %in% ids, ] 7.72 7.99 8.35 9.23 12.18 100
# dt[match(ids, id), ] 16.44 17.03 17.36 17.77 61.57 100
# dt[.(ids)] 1.93 2.16 2.27 2.43 5.77 100
我原本以为您可能也是能够使用
行名来做到这一点,我认为可以建立一个哈希表并有效地索引
。但这显然不是这种情况:
I had originally thought you might also be able to do this with rownames, which I thought built up a hash table and did the indexing efficiently. But that's obviously not the case:
df2 <- df
rownames(df2) <- as.character(df$id)
df2[as.character(ids), ],
microbenchmark(
df[df$id %in% ids , ], # won't preserve order
df2[as.character(ids), ],
times = 1
)
# Unit: milliseconds
# expr min lq median uq max neval
# df[df$id %in% ids, ] 15.3 15.3 15.3 15.3 15.3 1
# df2[as.character(ids), ] 3609.8 3609.8 3609.8 3609.8 3609.8 1
这篇关于加快在大型R数据框中搜索索引的速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!