评估data.table比data.frame更快的大小 [英] Evaluate at which size data.table is faster than data.frame

查看:76
本文介绍了评估data.table比data.frame更快的大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以帮我评估一下使用data.table进行搜索的数据帧的大小更快吗?在我的用例中,数据帧将是24,000行和560,000行。

Can someone please help me evaluate at which size of a data frame using data.table is faster for searches? In my use case the data frames will be 24,000 rows and 560,000 rows. Blocks of 40 rows are always singled out for further use.

示例:
DF是一个具有120行,7列(x1至x7)的数据帧;通常将40行的块单独选出以供进一步使用。 字符串占据x1的前40行。

Example: DF is a data frame with 120 rows, 7 columns (x1 to x7); "string" occupies the first 40 rows of x1.

DF2是DF => 120,000行的1000倍

DF2 is 1000 times DF => 120,000 rows

对于DF数据的大小,表比较慢,

For the size of DF data.table is slower, for the size of DF2 it is faster.

代码:

> DT <- data.table(DF)
> setkey(DT, x1)
> 
> DT2 <- data.table(DF2)
> setkey(DT2, x1)
> 
> microbenchmark(DF[DF$x1=="string", ], unit="us")
Unit: microseconds
                    expr     min       lq   median       uq     max neval
 DF[DF$x1 == "string", ] 282.578 290.8895 297.0005 304.5785 2394.09   100
> microbenchmark(DT[.("string")], unit="us")
Unit: microseconds
            expr      min       lq  median      uq      max neval
 DT[.("string")] 1473.512 1500.889 1536.09 1709.89 6727.113   100
> 
> 
> microbenchmark(DF2[DF2$x1=="string", ], unit="us")
Unit: microseconds
                      expr     min       lq   median       uq      max neval
 DF2[DF2$x1 == "string", ] 31090.4 34694.74 35537.58 36567.18 61230.41   100
> microbenchmark(DT2[.("string")], unit="us")
Unit: microseconds
             expr      min       lq   median       uq      max neval
 DT2[.("string")] 1327.334 1350.801 1391.134 1457.378 8440.668   100


推荐答案

library(microbenchmark)
library(data.table)
timings <- sapply(1:10, function(n) {
  DF <- data.frame(id=rep(as.character(seq_len(2^n)), each=40), val=rnorm(40*2^n), stringsAsFactors=FALSE)
  DT <- data.table(DF, key="id")     
  tofind <- unique(DF$id)[n-1]
  print(microbenchmark( DF[DF$id==tofind,],
                        DT[DT$id==tofind,],
                        DT[id==tofind],
                        `[.data.frame`(DT,DT$id==tofind,),
                        DT[tofind]), unit="ns")$median
})

matplot(1:10, log10(t(timings)), type="l", xlab="log2(n)", ylab="log10(median (ns))", lty=1)
legend("topleft", legend=c("DF[DF$id == tofind, ]",
                           "DT[DT$id == tofind, ]",
                           "DT[id == tofind]",
                           "`[.data.frame`(DT,DT$id==tofind,)",
                           "DT[tofind]"),
       col=1:5, lty=1)

data.table 有自编写此文档以来进行了一些更新(由于内置了更多参数/健壮性检查,因此 [。data.table 添加了更多的开销,但也引入了自动索引)。这是来自GitHub 1.9.7的2016年1月13日版本的更新版本:

data.table has made a few updates since this was written (a bit more overhead added to [.data.table as a few more arguments / robustness checks have been built in, but also the introduction of auto-indexing). Here's an updated version as of the January 13, 2016 version of 1.9.7 from GitHub:

主要的创新之处在于,第三个选项现在利用了自动索引。主要结论仍然是相同的-如果您的表具有任何非平凡的大小(大约大于500个观察值), data.table 的帧内调用会更快。

The main innovation is that the third option now leverages auto-indexing. The main conclusion remains the same -- if your table is of any nontrivial size (roughly larger than 500 observations), data.table's within-frame calling is faster.

(有关更新图的注释:一些小事情(取消记录y轴,以微秒为单位表示,更改x轴标签,添加标题),但是不平凡的是我更新了 microbenchmark 来增加估算的稳定性-即,我设置了 as.integer(1e5 / 2 ^ n)

(notes about the updated plot: some minor things (un-logging the y-axis, expressing in microseconds, changing the x-axis labels, adding a title), but one non-trivial thing is I updated the microbenchmarks to add some stability in the estimates--namely, I set the times argument to as.integer(1e5/2^n))

这篇关于评估data.table比data.frame更快的大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆