R从列表vs data.table vs hash中快速查找单个项目 [英] R fast single item lookup from list vs data.table vs hash

查看:14
本文介绍了R从列表vs data.table vs hash中快速查找单个项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常遇到的一个问题是需要从 data.table 中查找任意行.昨天我遇到了一个问题,我试图加速循环并使用 profvis 我发现从 data.table 查找是循环中成本最高的部分.然后我决定尝试找到在 R 中进行单项查找的最快方法.

One of the problems I often face is needing to look up an arbitrary row from a data.table. I ran into a problem yesterday where I was trying to speed up a loop and using profvis I found that the lookup from the data.table was the most costly part of the loop. I then decided to try and find the fastest way to do a single item lookup in R.

数据通常采用 data.table 的形式,其中包含字符类型的键列.其余列通常是数值.我试图创建一个与我经常处理的具有相似特征的随机表,这意味着 > 100K 行.我比较了原生列表、data.table 包和 hash 包.原生列表和 data.table 在单个项目查找性能方面具有可比性.Hash 似乎快了两个数量级.测试由随机抽样的 10 组 10,000 个密钥组成,以提供访问行为的差异.每个查找方法都使用相同的键集.

The data often takes the form of a data.table with a key column of the character type. The remaining columns are typically numeric values. I tried to create a random table with similar characteristics to what I often deal with which means >100K rows. I compared the native list, data.table package and the hash package. The native list and data.table were comparable for individual item lookup performance. Hash appeared to be two orders of magnitude faster. The tests were made up of 10 sets of 10,000 keys randomly sampled to provide for variance in access behavior. Each lookup method used the same sets of keys.

最终,我的偏好是让 data.table 的行查找更快,而不必为我的数据创建一个哈希表,或者确定它无法完成,只在我必须做的时候使用 hash 包快速查找.我不知道这是否可能,但您能否创建一个对 data.table 中的行的引用的哈希表,以允许使用哈希包进行快速查找?我知道这种事情在 C++ 中是可能的,但据我所知,由于缺少指针,R 不允许这种事情.

Ultimately my preference would be to either get the row lookup for data.table to be faster instead of having to create a hash table of my data or establish that it cannot be done and just use the hash package when I have to do fast lookup. I don't know if it would be possible but could you create a hash table of references to the rows in the data.table to allow for fast lookup using the hash package? I know that type of thing is possible in C++ but to my knowledge R does not allow this kind of thing due to the lack of pointers.

总结:1) 我是否正确使用 data.table 进行查找,因此这是我应该期望的单行查找速度?2) 是否可以创建一个指向 data.table 行的指针散列以允许以这种方式快速查找?

To Summarize: 1) Am I using data.table correctly for the lookups and therefore this is the speed I should expect for a single row lookup? 2) Would it be possible to create a hash of pointers to the data.table rows to allow for fast lookup that way?

Windows 10 专业版 x64

Windows 10 Pro x64

R 3.2.2

data.table 1.9.6

data.table 1.9.6

哈希 2.2.6

配备 16 GB RAM 的英特尔酷睿 i7-5600U

Intel Core i7-5600U with 16 GB RAM

library(microbenchmarkCore) # install.packages("microbenchmarkCore", repos="http://olafmersmann.github.io/drat")
library(data.table)
library(hash)

# Set seed to 42 to ensure repeatability
set.seed(42)

# Setting up test ------

# Generate product ids
product_ids <- as.vector(
  outer(LETTERS[seq(1, 26, 1)],
    outer(outer(LETTERS[seq(1, 26, 1)], LETTERS[seq(1, 26, 1)], paste, sep=""),
          LETTERS[seq(1, 26, 1)], paste, sep = ""
    ), paste, sep = ""
  )
)

# Create test lookup data
test_lookup_list <- lapply(product_ids, function(id){
  return_list <- list(
    product_id = id,
    val_1 = rnorm(1),
    val_2 = rnorm(1),
    val_3 = rnorm(1),
    val_4 = rnorm(1),
    val_5 = rnorm(1),
    val_6 = rnorm(1),
    val_7 = rnorm(1),
    val_8 = rnorm(1)
  )
  return(return_list)
})

# Set names of items in list
names(test_lookup_list) <- sapply(test_lookup_list, function(elem) elem[['product_id']])

# Create lookup hash
lookup_hash <- hash(names(test_lookup_list), test_lookup_list)

# Create data.table from list and set key of data.table to product_id field
test_lookup_dt <- rbindlist(test_lookup_list)
setkey(test_lookup_dt, product_id)

test_lookup_env <- list2env(test_lookup_list)

# Generate sample of keys to be used for speed testing
lookup_tests <- lapply(1:10, function(x){
  lookups <- sample(test_lookup_dt$product_id, 10000)
  return(lookups)
})

# Native list timing
native_list_timings <- sapply(lookup_tests, function(lookups){
  timing <- system.nanotime(
    for(lookup in lookups){
      return_value <- test_lookup_list[[lookup]]
    }    
  )
  return(timing[['elapsed']])
})

# Data.table timing
datatable_timings <- sapply(lookup_tests, function(lookups){
  timing <- system.nanotime(
    for(lookup in lookups){
      return_value <- test_lookup_dt[lookup]
    }
  )
  return(timing[['elapsed']])
})


# Hashtable timing
hashtable_timings <- sapply(lookup_tests, function(lookups){
  timing <- system.nanotime(
    for(lookup in lookups){
      return_value <- lookup_hash[[lookup]]
    }
  )
  return(timing[['elapsed']])
})

# Environment timing
environment_timings <- sapply(lookup_tests, function(lookups){
  timing <- system.nanotime(
    for(lookup in lookups){
      return_value <- test_lookup_env[[lookup]]
    }
  )
  return(timing[['elapsed']])
})

# Summary of timing results
summary(native_list_timings)
summary(datatable_timings)
summary(hashtable_timings)
summary(environment_timings)

结果如下:

> # Summary of timing results
> summary(native_list_timings)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  35.12   36.20   37.28   37.05   37.71   39.24 
> summary(datatable_timings)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  49.13   51.51   52.64   52.76   54.39   55.13 
> summary(hashtable_timings)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1588  0.1857  0.2107  0.2213  0.2409  0.3258 
> summary(environment_timings)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.09322 0.09524 0.10680 0.11850 0.13760 0.17140 

在这种特殊情况下,hash 查找似乎比本地列表或 data.table 快大约两个数量级.

It appears that the hash lookup is approximately two orders of magnitude faster than either the native list or data.table in this particular scenario.

我收到了 Neal Fultz 的反馈,建议使用本机 Environment 对象.这是我得到的代码和结果:

I received feedback from Neal Fultz suggesting the use of the native Environment object. Here is the code and result I got:

test_lookup_env <- list2env(test_lookup_list)
# Environment timing
environment_timings <- sapply(lookup_tests, function(lookups){
  timing <- system.nanotime(
    for(lookup in lookups){
      return_value <- test_lookup_env[[lookup]]
    }
  )
  return(timing[['elapsed']])
})
summary(environment_timings)
> summary(environment_timings)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.09322 0.09524 0.10680 0.11850 0.13760 0.17140 

确实,在这种情况下,环境对于单个项目的访问似乎更快.感谢 Neal Fultz 指出这种方法.我很高兴对 R 中可用的对象类型有更透彻的了解.我的问题仍然存在:我是否正确使用了 data.table(我希望如此,但我愿意接受批评),有没有办法使用某种指针魔术提供对 data.table 行的行访问,这将提供更快的单个行访问.

Indeed, it appears that Environment is faster for individual item access in this scenario. Thank you Neal Fultz for pointing this method out. I appreciate having a more thorough understanding of the object types available in R. My questions still stand: am I using data.table correctly (I expect so but I am open to critique) and is there a way to provide row access to the rows of a data.table using some kind of pointer magic which would provide faster individual row access.

有人提到我在测试的最内层循环中的访问模式效率低下.我同意.我想做的是尽可能地模仿我正在处理的情况.这实际上发生的循环不允许矢量化,这就是我不使用它的原因.我意识到这并不是严格意义上的R"做事方式.我的代码中的 data.table 提供了参考信息,在我进入循环之前我不一定知道我需要哪一行,这就是为什么我试图弄清楚如何访问单个项目的原因尽快,最好将数据仍然存储在 data.table 中.这在一定程度上也是一个好奇的问题,可以做到吗?

There have been some mentions that my access pattern in the inner-most loop of my test is inefficient. I agree. What I am trying to do is emulate as closely as possible the situation that I am dealing with. The loop that this is actually occurring in does not allow for vectorization which is why I am not using it. I realize this is not strictly the 'R' way of doing things. The data.table in my code is providing reference information and I do not necessarily know which row I need until I am inside the loop which is why I am trying to figure out how to access an individual item as quickly as possible, preferably with the data still stored in a data.table. This is also in part a curiosity question, can it be done?

我从@jangrorecki 收到反馈说,使用 Sys.time() 是衡量函数性能的无效方法.我已经根据建议修改了代码以使用 system.nanotime() .原代码已更新及计时结果.

I received feedback from @jangrorecki that using Sys.time() is an ineffective means of measuring the performance of a function. I have since revised the code to use system.nanotime() per the suggestion. The original code has been updated and the timing results.

问题仍然存在:这是对 data.table 进行行查找的最快方法吗?如果是,是否可以创建指向行的指针散列以便快速查找?在这一点上,我很好奇 R 可以推多远.作为一个来自 C++ 的人,这是一个有趣的挑战.

The question still stands: is this the fastest way to do a row lookup of a data.table and if so is it possible to create a hash of pointers to the rows for quick lookup? At this point I am most curious how far R can be pushed. As someone who came from C++, this is a fun challenge.

我接受了 Neal Fultz 提供的答案,因为它讨论了我真正想知道的内容.也就是说,这不是 data.table 的预期用途,因此没有人应该将其解释为意味着 data.table 很慢,它实际上非常快.这是我很好奇的一个非常特殊的用例.我的数据以 data.table 的形式出现,我想知道是否可以在将其保留为 data.table 的同时快速访问行.我还想将 data.table 访问速度与哈希表进行比较,哈希表通常用于快速、非矢量化的项目查找.

I accepted the answer provided by Neal Fultz because it discussed what I was actually wanting to know. That said, this is not the way data.table was intended to be used so no one should interpret this to mean data.table is slow, it is actually incredibly fast. This was a very particular use case that I was curious about. My data comes in as a data.table and I wanted to know if I could get quick row access while leaving it as a data.table. I also wanted to compare the data.table access speed with a hash-table which is what is often used for fast, non-vectorized item lookup.

推荐答案

对于非向量化的访问模式,您可能需要尝试内置的 environment 对象:

For a non-vectorized access pattern, you might want to try the builtin environment objects:

require(microbenchmark)

test_lookup_env <- list2env(test_lookup_list)


x <- lookup_tests[[1]][1]
microbenchmark(
    lookup_hash[[x]],
    test_lookup_list[[x]],
    test_lookup_dt[x],
    test_lookup_env[[x]]
)

在这里你可以看到它甚至比 hash 更简洁:

Here you can see it's even zippier than hash :

Unit: microseconds
                  expr      min        lq       mean    median        uq      max neval
      lookup_hash[[x]]   10.767   12.9070   22.67245   23.2915   26.1710   68.654   100
 test_lookup_list[[x]]  847.700  853.2545  887.55680  863.0060  893.8925 1369.395   100
     test_lookup_dt[x] 2652.023 2711.9405 2771.06400 2758.8310 2803.9945 3373.273   100
  test_lookup_env[[x]]    1.588    1.9450    4.61595    2.5255    6.6430   27.977   100

单步执行 data.table:::`[.data.table` 可以说明为什么您会看到 dt 变慢.当你用一个字符索引并且有一个键集时,它会做大量的簿记工作,然后下降到 bmerge,这是一种二进制搜索.二分查找是 O(log n) 并且会随着 n 的增加而变慢.

Stepping through data.table:::`[.data.table` is instructive why you are seeing dt slow down. When you index with a character and there is a key set, it does quite a bit of bookkeeping, then drops down into bmerge, which is a binary search. Binary search is O(log n) and will get slower as n increases.

另一方面,环境使用散列(默认情况下)并且相对于 n 具有恒定的访问时间.

Environments, on the other hand, use hashing (by default) and have constant access time with respect to n.

要解决这个问题,您可以手动构建地图并通过它建立索引:

To work around, you can manually build a map and index through it:

x <- lookup_tests[[2]][2]

e <- list2env(setNames(as.list(1:nrow(test_lookup_dt)), test_lookup_dt$product_id))

#example access:
test_lookup_dt[e[[x]], ]

但是,在 data.table 方法中看到如此多的簿记代码,我也会尝试使用普通的旧 data.frames:

However, seeing so much bookkeeping code in the data.table method, I'd try out plain old data.frames as well:

test_lookup_df <- as.data.frame(test_lookup_dt)

rownames(test_lookup_df) <- test_lookup_df$product_id

如果我们真的很偏执,我们可以完全跳过 [ 方法并直接覆盖列.

If we are really paranoid, we could skip the [ methods altogether and lapply over the columns directly.

这里有更多的时间(来自与上面不同的机器):

Here are some more timings (from a different machine than above):

> microbenchmark(
+   test_lookup_dt[x,],
+   test_lookup_dt[x],
+   test_lookup_dt[e[[x]],],
+   test_lookup_df[x,],
+   test_lookup_df[e[[x]],],
+   lapply(test_lookup_df, `[`, e[[x]]),
+   lapply(test_lookup_dt, `[`, e[[x]]),
+   lookup_hash[[x]]
+ )
Unit: microseconds
                                expr       min         lq        mean     median         uq       max neval
                 test_lookup_dt[x, ]  1658.585  1688.9495  1992.57340  1758.4085  2466.7120  2895.592   100
                   test_lookup_dt[x]  1652.181  1695.1660  2019.12934  1764.8710  2487.9910  2934.832   100
            test_lookup_dt[e[[x]], ]  1040.869  1123.0320  1356.49050  1280.6670  1390.1075  2247.503   100
                 test_lookup_df[x, ] 17355.734 17538.6355 18325.74549 17676.3340 17987.6635 41450.080   100
            test_lookup_df[e[[x]], ]   128.749   151.0940   190.74834   174.1320   218.6080   366.122   100
 lapply(test_lookup_df, `[`, e[[x]])    18.913    25.0925    44.53464    35.2175    53.6835   146.944   100
 lapply(test_lookup_dt, `[`, e[[x]])    37.483    50.4990    94.87546    81.2200   124.1325   241.637   100
                    lookup_hash[[x]]     6.534    15.3085    39.88912    49.8245    55.5680   145.552   100

总体而言,要回答您的问题,您没有使用 data.table 错误",但您也没有按照预期的方式使用它(矢量化访问).但是,您可以手动构建一个映射以进行索引并获得大部分性能.

Overall, to answer your questions, you are not using data.table "wrong" but you are also not using it in the way it was intended (vectorized access). However, you can manually build a map to index through and get most of the performance back.

这篇关于R从列表vs data.table vs hash中快速查找单个项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆