加快在大型R数据框中搜索索引的速度 [英] Speeding up searching for indices within a Large R Data Frame

查看:82
本文介绍了加快在大型R数据框中搜索索引的速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这看起来像一个无害的简单问题,但是执行时间很长。

This may look like an innocuously simple problem but it takes a very long time to execute. Any ideas for speeding it up or vectorization etc. would be greatly appreciated.

我有一个具有500万行和50列的R数据框: OriginalDataFrame

I have a R data frame with 5 million rows and 50 columns : OriginalDataFrame

该帧的索引列表: IndexList (55000 [ numIndex ]唯一索引)

A list of Indices from that Frame : IndexList (55000 [numIndex] unique indices)

它是一个时间序列,因此有约500万行用于55K唯一索引。

Its a time series so there are ~ 5 Million rows for 55K unique indices.

OriginalDataFrame 已由 dataIndex 进行排序。 IndexList 中的所有索引都不在 OriginalDataFrame 中。现在的任务是找到存在的索引,并构建一个新的数据框: FinalDataFrame

The OriginalDataFrame has been ordered by dataIndex. All the indices in IndexList are not present in OriginalDataFrame. The task is to find the indices that are present, and construct a new data frame : FinalDataFrame

当前我是使用 library(foreach)运行此代码:

Currently I am running this code using library(foreach):

FinalDataFrame <- foreach (i=1:numIndex, .combine="rbind") %dopar% {
  OriginalDataFrame[(OriginalDataFrame$dataIndex == IndexList[i]),]
}

我在具有24个内核和128GB RAM的计算机上运行此程序,但这大约需要6个小时才能完成。

I run this on a machine with 24 cores and 128GB RAM and yet this takes around 6 hours to complete.

我是不是在做些愚蠢的事情,或者R中有更好的方法做到这一点?

Am I doing something exceedingly silly or are there better ways in R to do this?

推荐答案

这里有一个基准,用于比较data.table和data.frame。如果您知道这种情况下的特殊数据表调用,它的速度大约是原来的7倍,而忽略了建立索引的费用(该索引相对较小,通常会在多次调用中摊销)。如果您不知道特殊的语法,只会更快一点。 (请注意,问题的大小比原始大小要小一些,以便于研究)

Here's a little benchmark comparing data.table to data.frame. If you know the special data table invocation for this case, it's about 7x faster, ignoring the cost of setting up the index (which is relatively small, and would typically be amortised across multiple calls). If you don't know the special syntax, it's only a little faster. (Note the problem size is a little smaller than the original to make it easier to explore)

library(data.table)
library(microbenchmark)
options(digits = 3)

# Regular data frame
df <- data.frame(id = 1:1e5, x = runif(1e5), y = runif(1e5))

# Data table, with index
dt <- data.table(df)
setkey(dt, "id")

ids <- sample(1e5, 1e4)

microbenchmark(
  df[df$id %in% ids , ], # won't preserve order
  df[match(ids, df$id), ],
  dt[id %in% ids, ],
  dt[match(ids, id), ],
  dt[.(ids)]
)
# Unit: milliseconds
#                     expr   min    lq median    uq   max neval
#     df[df$id %in% ids, ] 13.61 13.99  14.69 17.26 53.81   100
#  df[match(ids, df$id), ] 16.62 17.03  17.36 18.10 21.22   100
#        dt[id %in% ids, ]  7.72  7.99   8.35  9.23 12.18   100
#     dt[match(ids, id), ] 16.44 17.03  17.36 17.77 61.57   100
#               dt[.(ids)]  1.93  2.16   2.27  2.43  5.77   100

我原本以为您可能也是能够使用
行名来做到这一点,我认为可以建立一个哈希表并有效地索引
。但这显然不是这种情况:

I had originally thought you might also be able to do this with rownames, which I thought built up a hash table and did the indexing efficiently. But that's obviously not the case:

df2 <- df
rownames(df2) <- as.character(df$id)
df2[as.character(ids), ],

microbenchmark(
  df[df$id %in% ids , ], # won't preserve order
  df2[as.character(ids), ],
  times = 1
)
# Unit: milliseconds
#                     expr    min     lq median     uq    max neval
#     df[df$id %in% ids, ]   15.3   15.3   15.3   15.3   15.3     1
# df2[as.character(ids), ] 3609.8 3609.8 3609.8 3609.8 3609.8     1

这篇关于加快在大型R数据框中搜索索引的速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆