从 data.table 和 data.frame 对象中获取单个元素的时间 [英] Time in getting single elements from data.table and data.frame objects
问题描述
在我的工作中,我曾经有几个表(客户详细信息、交易记录等).由于其中一些非常大(数百万行),我最近切换到 data.table
包(感谢 Matthew).但是,其中一些非常小(几百行和 4/5 列)并且被多次调用.因此,我开始考虑 检索 数据中的 [.data.table
开销,而不是在 ?set
中已经清楚描述的 set()ting 值,其中,无论表的大小如何,都会在大约 2 微秒内设置一项(取决于 cpu).
In my work I use to have several tables (customer details, transaction records, etc). Being some of them are very big (millions of rows), I've recently switched to the data.table
package (thanks Matthew). However, some of them are quite small (few hundreds of rows and 4/5 column) and are called several times. Therefore I started to think about [.data.table
overhead in retrieving data rather then set()ting value as already clearly described in ?set
, where, regardless the size of table one item is set in around 2 microseconds (depending on cpu).
但是,它似乎不存在与 set
等效的方法,用于从 data.table
获取值,知道确切的行和列.一种 loopable [.data.table
.
However it doesn't seem to exist the equivalent of set
for getting a value from a data.table
knowing the exact row and column. A sort of loopable [.data.table
.
library(data.table)
library(microbenchmark)
m = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m) # same data used in ?set
> microbenchmark(DF[3450,1] , DT[3450, V1], times=1000) # much more overhead in DT
Unit: microseconds
expr min lq median uq max neval
DF[3450, 1] 32.745 36.166 40.5645 43.497 193.533 1000
DT[3450, V1] 788.791 803.453 813.2270 832.287 5826.982 1000
> microbenchmark(DF$V1[3450], DT[3450, 1, with=F], times=1000) # using atomic vector and
# removing part of DT overhead
Unit: microseconds
expr min lq median uq max neval
DF$V1[3450] 2.933 3.910 5.865 6.354 36.166 1000
DT[3450, 1, with = F] 297.629 303.494 305.938 309.359 1878.632 1000
> microbenchmark(DF$V1[3450], DT$V1[3450], times=1000) # using only atomic vectors
Unit: microseconds
expr min lq median uq max neval
DF$V1[3450] 2.933 2.933 3.421 3.422 40.565 1000 # DF seems still a bit faster (23%)
DT$V1[3450] 3.910 3.911 4.399 4.399 16.128 1000
最后一种方法确实是多次快速检索单个元素的最佳方法.然而,set
更快
The last method is indeed the best one to fast retrieve a single element several times. However, set
is even faster
> microbenchmark(set(DT,1L,1L,5L), times=1000)
Unit: microseconds
expr min lq median uq max neval
set(DT, 1L, 1L, 5L) 1.955 1.956 2.444 2.444 24.926 1000
问题是:如果我们可以在 2.444 微秒内设置
一个值,那么应该不可能得到一个较小 (或至少相似)的时间量?谢谢.
the question is: if we can set
a value in 2.444 microseconds shouldn't be possible to get a value in a smaller (or at least similar) amount of time? Thanks.
按照建议添加另外两个选项:
adding two more options as suggested:
> microbenchmark(`[.data.frame`(DT,3450,1), DT[["V1"]][3450], times=1000)
Unit: microseconds
expr min lq median uq max neval
`[.data.frame`(DT, 3450, 1) 46.428 47.895 48.383 48.872 2165.509 1000
DT[["V1"]][3450] 20.038 21.504 23.459 24.437 116.316 1000
不幸的是,这并不比之前的尝试快.
which unfortunately are not faster than the previous attempts.
推荐答案
感谢@hadley,我们有了解决方案!
Thanks to @hadley we have the solution!
> microbenchmark(DT$V1[3450], set(DT,1L,1L,5L), .subset2(DT, "V1")[3450], times=1000, unit="us")
Unit: microseconds
expr min lq median uq max neval
DT$V1[3450] 2.566 3.208 3.208 3.528 27.582 1000
set(DT, 1L, 1L, 5L) 1.604 1.925 1.925 2.246 15.074 1000
.subset2(DT, "V1")[3450] 0.000 0.321 0.322 0.642 8.339 1000
这篇关于从 data.table 和 data.frame 对象中获取单个元素的时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!