改善数据表子集性能 [英] Improving data.table subsetting performance

查看：44 发布时间：2020/9/20 19:04:04 r performance data.table subset benchmarking

本文介绍了改善数据表子集性能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在运行一个大型蒙特卡洛模拟，我发现子设置/搜索数据是我的代码中最慢的部分.为了测试一些替代方案，我使用数据帧，data.table和矩阵对性能进行了基准测试. 这是基准代码:

I am running a large monte-carlo simulation, and I discovered that sub-setting/searching my data is the slowest part of my code. In order to test some alternatives, I bench-marked performance with dataframes, data.table, and a matrix. Here is the benchmark code:

library(data.table)
#install.packages('profvis')
library(profvis)
x.df = data.frame(a=sample(1:10,10000,replace=T), b=sample(1:10,10000,replace=T)) # set up a dataframe
x.dt = as.data.table(x.df) # a data.table
setkey(x.dt,a) # set key for faster searches
x.mat = as.matrix(x.df) # a matrix

profvis({
for (i in 1:10000) {
  # test simple subsetting
  xsubset.mat = x.mat[100:200,2]
  xsubset.df = x.df[100:200,2]
  xsubset.dt = x.dt[100:200,2]
  # test search preformance
  xsearch.mat = x.mat[which(x.df$a==10),2]
  xsearch.df = x.df[which(x.df$a==10),2]
  xsearch.dt = x.dt[.(10),2]
}
})

这是我的结果: 认真地说，我喜欢data.table的紧凑语法，并且我想知道是否可以做些什么来提高其性能.根据创作者的说法，它应该超级快.我使用不正确吗?

Here are my results: In all seriousness, I love the compact syntax of data.table, and I am wondering if there is something I can do to improve its performance. According to the creators, its supposed to be super fast. Am I using it incorrectly?

推荐答案

经过一些基准测试之后，我现在了解了这个问题.最快的软件包取决于我要进行多次小型搜索还是一次大型搜索.似乎data.table每次搜索有很多开销，这使其更适合于处理一个大型表，而对于小型表则没有太多搜索.

After some more benchmarking, I now understand the issue. The fastest package depends on whether I'm doing many small searches or one big search. It seems that data.table has a lot of overhead per search, making it more suited for working with one huge table, not many searches on small ones.

考虑以下代码，并与原始代码进行比较:

Consider the following code, and compare with the original:

# make a giant table, but search it only once:
x.df = data.frame(a=sample(1:10,100000000,replace=T), b=sample(1:10,100000000,replace=T))
x.dt = as.data.table(x.df)
setkey(x.dt,a)
x.mat = as.matrix(x.df)

profvis({
for (i in 1:1) {
  xsubset.mat = x.mat[100:200,2]
  xsubset.df = x.df[100:200,2]
  xsubset.dt = x.dt[100:200,2]

  xsearch.mat = x.mat[which(x.df$a==10),2]
  xsearch.df = x.df[which(x.df$a==10),2]
  xsearch.dt = x.dt[.(10),2]
}
})

结果:

这篇关于改善数据表子集性能的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

改善数据表子集性能 [英] Improving data.table subsetting performance

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

改善数据表子集性能 [英] Improving data.table subsetting performance

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭