基准数据框架(基准),数据框架(包数据框架)和数据表格 [英] Benchmarking data.frame (base), data.frame(package dataframe) and data.table

查看:96
本文介绍了基准数据框架(基准),数据框架(包数据框架)和数据表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

随着最近推出的包 dataframe ,我认为是时候正确地对各种数据结构进行基准测试,并突出显示每个数据结构是最好的。我不是每个人的不同优势的专家,所以我的问题是,我们应该如何对他们进行基准化。



我试过的一些(相当粗糙的)

  library(microbenchmark)
library(data.table)
mat< - matrix(rnorm 10000),nrow = 100)
mat2df.base< - data.frame(mat)
library(dataframe)
mat2df.dataframe< - data.frame(mat)
mat2dt bm < - microbenchmark(t(mat),t(mat2df.base),t(mat2df.dataframe),t(mat2dt),times = 1000)

结果:

 单位:微秒
expr min lq median uq max
1 t(mat)20.927 23.210 31.201 36.908 951.591
2 t(mat2df.base)929.903 974.039 997.439 1040.814 28270.717
3 t(mat2df.dataframe)924.957 969.093 992.683 1025.404 27255.205
4 t(mat2dt)1749.465 1817.382 1857.903 1909.649 5347.321


解决方案

我没有data.table专家,但从我的理解,它的主要优势是索引。因此,尝试使用各种软件包进行子集化以比较速度。

 库(microbenchmark)
库(data.table)
mat < - matrix(rnorm(1e7),ncol = 10)
key< - as.character(sample(1:10,1e6,replace = TRUE))
mat2df.base < - data.frame(mat)
mat2df.base $ key< - key

bm.before< - microbenchmark(
mat2df.base [mat2df.base $ key == 2,]


库(dataframe)
mat2df.dataframe< - data.frame(mat)
mat2df.dataframe $ key& - key
mat2dt< - data.table(mat)
mat2dt $ key< - key
setkey(mat2dt,key)


bm .subset< - microbenchmark(
mat2df.base [mat2df.base $ key == 2,],
mat2df.dataframe [mat2df.dataframe $ key == 2,],
mat2dt [2,]


expr min lq median

uq max
1 mat2df.base [mat2df.base $ key == 2 ,] 153.99596 154.98602 155.91621 154.98602 155.91621 157.0894 194.24456
2 mat2df.dataframe [mat2df.dataframe $ key == 2,] 153.63907 154.66295 155.68553 156.9827 173.76913
3 mat2dt [2,] 15.51085 15.66742 15.72899 15.8463 22.53044

使用足够大的矩阵,data.table会使用其他选项擦除表。

$ b此外,我怀疑@ RJ-试图比较基本data.frame与包 dataframe 的性能。 s data.frames不工作。表演太类似,我怀疑结果是加载的库不是基础的结果。



编辑:测试。似乎没有太大的区别。 bm.after与上面的bm.subset是相同的代码,只是与bm.before同时提供一个准确的比较。

  bm.before<  -  microbenchmark(
mat2df.base [mat2df.base $ key == 2,]


> bm.after
单位:毫秒
expr min lq median uq max
1 mat2df.base [mat2df.base $ key == 2,] 160.62708 166.25787 167.52325 169.18710 173.47864
2 mat2df .dataframe [mat2df.dataframe $ key == 2,] 163.30259 166.00588 167.80138 169.24647 174.05713
3 mat2dt [2,] 16.16117 16.89627 17.09047 17.37057 62.01954

> bm.before
单位:毫秒
expr min lq median uq max
1 mat2df.base [mat2df.base $ key == 2,] 159.178 160.9867 162.1149 164.0046 195.9501


With the recent introduction of the package dataframe, I thought it was time to properly benchmark the various data structures and to highlight what each is best at. I'm no expert at the different strengths of each, so my question is, how should we go about benchmarking them.

Some (rather crude) things I have tried:

library(microbenchmark)
library(data.table)
mat <- matrix(rnorm(10000), nrow = 100)
mat2df.base <- data.frame(mat)
library(dataframe)
mat2df.dataframe <- data.frame(mat)
mat2dt <- data.table(mat)
bm <- microbenchmark(t(mat), t(mat2df.base), t(mat2df.dataframe), t(mat2dt), times = 1000)

Results:

Unit: microseconds
                 expr      min       lq   median       uq       max
1              t(mat)   20.927   23.210   31.201   36.908   951.591
2      t(mat2df.base)  929.903  974.039  997.439 1040.814 28270.717
3 t(mat2df.dataframe)  924.957  969.093  992.683 1025.404 27255.205
4           t(mat2dt) 1749.465 1817.382 1857.903 1909.649  5347.321

解决方案

I'm no data.table expert, but from what I understand its primary advantage is in indexing. So try subsetting with the various packages to compare speeds.

library(microbenchmark)
library(data.table)
mat <- matrix(rnorm(1e7), ncol = 10) 
key <- as.character(sample(1:10,1e6,replace=TRUE))
mat2df.base <- data.frame(mat)
mat2df.base$key <- key

bm.before <- microbenchmark( 
  mat2df.base[mat2df.base$key==2,] 
)

library(dataframe)
mat2df.dataframe <- data.frame(mat)
mat2df.dataframe$key <- key
mat2dt <- data.table(mat)
mat2dt$key <- key
setkey(mat2dt,key)


bm.subset <- microbenchmark( 
  mat2df.base[mat2df.base$key==2,], 
  mat2df.dataframe[mat2df.dataframe$key==2,],
  mat2dt["2",]
  )

                                       expr       min        lq    median   

    uq       max
1           mat2df.base[mat2df.base$key == 2, ] 153.99596 154.98602 155.91621 157.0894 194.24456
2 mat2df.dataframe[mat2df.dataframe$key == 2, ] 153.63907 154.66295 155.68553 156.9827 173.76913
3                                 mat2dt["2", ]  15.51085  15.66742  15.72899  15.8463  22.53044

With a sufficiently large matrix, data.table wipes the table with the other options.

Also, I suspect that @RJ- 's attempt to compare the performance of base data.frame with the package dataframe's data.frames is not working. The performances are just too similar, and I suspect the results are those of the loaded library not of base.

Edit: Tested. Doesn't seem to make much of a difference. bm.after is the same code as bm.subset above, just run at the same time as bm.before to provide an accurate comparison.

bm.before <- microbenchmark( 
  mat2df.base[mat2df.base$key==2,] 
)

> bm.after
Unit: milliseconds
                                           expr       min        lq    median        uq       max
1           mat2df.base[mat2df.base$key == 2, ] 160.62708 166.25787 167.52325 169.18710 173.47864
2 mat2df.dataframe[mat2df.dataframe$key == 2, ] 163.30259 166.00588 167.80138 169.24647 174.05713
3                                 mat2dt["2", ]  16.16117  16.89627  17.09047  17.37057  62.01954

> bm.before
Unit: milliseconds
                                 expr     min       lq   median       uq      max
1 mat2df.base[mat2df.base$key == 2, ] 159.178 160.9867 162.1149 164.0046 195.9501

这篇关于基准数据框架(基准),数据框架(包数据框架)和数据表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆