为什么 plyr 这么慢? [英] Why is plyr so slow?

查看:22
本文介绍了为什么 plyr 这么慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我认为我使用 plyr 不正确.有人可以告诉我这是否是高效"的 plyr 代码吗?

I think I am using plyr incorrectly. Could someone please tell me if this is 'efficient' plyr code?

require(plyr)
plyr <- function(dd) ddply(dd, .(price), summarise, ss=sum(volume)) 

一点背景:我有一些大的聚合问题,我注意到它们每个都需要一些时间.在尝试解决这些问题时,我对 R 中各种聚合过程的性能产生了兴趣.

A little context: I have a few large aggregation problems and I have noted that they were each taking some time. In trying to solve the issues, I became interested in the performance of various aggregation procedures in R.

我测试了一些聚合方法 - 发现自己整天都在等待.

I tested a few aggregation methods - and found myself waiting around all day.

当我终于得到结果时,我发现 plyr 方法与其他方法之间存在巨大差距 - 这让我认为我做错了什么.

When I finally got results back, I discovered a huge gap between the plyr method and the others - which makes me think that I've done something dead wrong.

我运行了以下代码(我想我会在使用时检查新的数据框包):

I ran the following code (I thought I'd check out the new dataframe package while I was at it):

require(plyr)
require(data.table)
require(dataframe)
require(rbenchmark)
require(xts)

plyr <- function(dd) ddply(dd, .(price), summarise, ss=sum(volume)) 
t.apply <- function(dd) unlist(tapply(dd$volume, dd$price, sum))
t.apply.x <- function(dd) unlist(tapply(dd[,2], dd[,1], sum))
l.apply <- function(dd) unlist(lapply(split(dd$volume, dd$price), sum))
l.apply.x <- function(dd) unlist(lapply(split(dd[,2], dd[,1]), sum))
b.y <- function(dd) unlist(by(dd$volume, dd$price, sum))
b.y.x <- function(dd) unlist(by(dd[,2], dd[,1], sum))
agg <- function(dd) aggregate(dd$volume, list(dd$price), sum)
agg.x <- function(dd) aggregate(dd[,2], list(dd[,1]), sum)
dtd <- function(dd) dd[, sum(volume), by=(price)]

obs <- c(5e1, 5e2, 5e3, 5e4, 5e5, 5e6, 5e6, 5e7, 5e8)
timS <- timeBasedSeq('20110101 083000/20120101 083000')

bmkRL <- list(NULL)

for (i in 1:5){
  tt <- timS[1:obs[i]]

  for (j in 1:8){
    pxl <- seq(0.9, 1.1, by= (1.1 - 0.9)/floor(obs[i]/(11-j)))
    px <- sample(pxl, length(tt), replace=TRUE)
    vol <- rnorm(length(tt), 1000, 100)

    d.df <- base::data.frame(time=tt, price=px, volume=vol)
    d.dfp <- dataframe::data.frame(time=tt, price=px, volume=vol)
    d.matrix <- as.matrix(d.df[,-1])
    d.dt <- data.table(d.df)

    listLabel <- paste('i=',i, 'j=',j)

    bmkRL[[listLabel]] <- benchmark(plyr(d.df), plyr(d.dfp), t.apply(d.df),     
                         t.apply(d.dfp), t.apply.x(d.matrix), 
                         l.apply(d.df), l.apply(d.dfp), l.apply.x(d.matrix),
                         b.y(d.df), b.y(d.dfp), b.y.x(d.matrix), agg(d.df),
                         agg(d.dfp), agg.x(d.matrix), dtd(d.dt),
          columns =c('test', 'elapsed', 'relative'),
          replications = 10,
          order = 'elapsed')
  }
}

该测试应该检查到 5e8,但花了太长时间 - 主要是由于 plyr.5e5决赛桌说明问题:

The test was supposed to check up to 5e8, but it took too long - mostly due to plyr. The 5e5 the final table shows the problem:

$`i= 5 j= 8`
                  test  elapsed    relative
15           dtd(d.dt)    4.156    1.000000
6        l.apply(d.df)   15.687    3.774543
7       l.apply(d.dfp)   16.066    3.865736
8  l.apply.x(d.matrix)   16.659    4.008422
4       t.apply(d.dfp)   21.387    5.146054
3        t.apply(d.df)   21.488    5.170356
5  t.apply.x(d.matrix)   22.014    5.296920
13          agg(d.dfp)   32.254    7.760828
14     agg.x(d.matrix)   32.435    7.804379
12           agg(d.df)   32.593    7.842397
10          b.y(d.dfp)   98.006   23.581809
11     b.y.x(d.matrix)   98.134   23.612608
9            b.y(d.df)   98.337   23.661453
1           plyr(d.df) 9384.135 2257.972810
2          plyr(d.dfp) 9384.448 2258.048123

这是对的吗?为什么 plyr 2250x 比 data.table 慢?为什么不使用新的数据框包?

Is this right? Why is plyr 2250x slower than data.table? And why didn't using the new data frame package make a difference?

会话信息是:

> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xts_0.8-6        zoo_1.7-7        rbenchmark_0.3   dataframe_2.5    data.table_1.8.1     plyr_1.7.1      

loaded via a namespace (and not attached):
[1] grid_2.15.1    lattice_0.20-6 tools_2.15.1 

推荐答案

为什么这么慢?一项小型研究发现了 2011 年 8 月发布的邮件组,其中包作者 @hadley 状态

Why it is so slow? A little research located a mail group posting from a Aug. 2011 where @hadley, the package author, states

这是 ddply 始终处理数据的方式的一个缺点帧.如果您使用汇总而不是data.frame (因为 data.frame 很慢),但我还在想关于如何克服 ddply 的这一基本限制接近.

This is a drawback of the way that ddply always works with data frames. It will be a bit faster if you use summarise instead of data.frame (because data.frame is very slow), but I'm still thinking about how to overcome this fundamental limitation of the ddply approach.

<小时>

至于 高效 plyr 代码我也不知道.经过一堆参数测试和基准测试,看起来我们可以做得更好.


As for being efficient plyr code I didn't know either. After a bunch of param testing and bench-marking it looks like we can do better.

您的命令中的 summarize() 只是一个辅助函数,纯粹而简单.我们可以用我们自己的 sum 函数替换它,因为它对任何不简单的事情都没有帮助,并且可以制作 .data.(price) 参数更明确.结果是

The summarize() in your command is a just helper function, pure and simple. We can replace it with our own sum function since it isn't helping with anything that isn't already simple and the .data and .(price) arguments can be made more explicit. The result is

ddply( dd[, 2:3], ~price, function(x) sum( x$volume ) )

summarize 可能看起来不错,但它并不比简单的函数调用快.这说得通;看看我们的小函数与 codesummarize代码>.使用修改后的公式运行基准测试会产生明显的收益.不要认为这意味着您错误地使用了 plyr,您没有,它只是效率不高;你不能用它做的任何事情都不会像其他选项一样快.

The summarize may seem nice, but it just isn't quicker than a simple function call. It makes sense; just look at our little function versus the code for summarize. Running your benchmarks with the revised formula yields a noticeable gain. Don't take that to mean you've used plyr incorrectly, you haven't, it just isn't efficient; nothing you can do with it will make it as fast as other options.

在我看来,优化后的函数仍然很糟糕,因为它并不清楚,必须进行心理解析,而且与 data.table 相比仍然慢得离谱(即使有 60% 的增益).

In my opinion the optimized function still stinks as it isn't clear and must be mentally parsed along with still being ridiculously slow compared with data.table ( even with a 60% gain ).

在同一线程中提及上面,关于plyr的慢,提到了一个plyr2项目.自问题的原始答案以来,plyr 作者已发布 dplyr 作为 plyr 的继任者.虽然 plyr 和 dplyr 都被称为数据操作工具,并且您主要声明的兴趣是聚合,但您可能仍然对新软件包的基准测试结果感兴趣以进行比较,因为它具有重新设计的后端以提高性能.

In the same thread mentioned above, regarding the slowness of plyr, a plyr2 project is mentioned. Since the time of the original answer to the question the plyr author has released dplyr as the successor of plyr. While both plyr and dplyr are billed as data manipulation tools and your primary stated interest is aggregation you may still be interested in your benchmark results of the new package for comparison as it has a reworked backend to improve performance.

plyr_Original   <- function(dd) ddply( dd, .(price), summarise, ss=sum(volume))
plyr_Optimized  <- function(dd) ddply( dd[, 2:3], ~price, function(x) sum( x$volume ) )

dplyr <- function(dd) dd %.% group_by(price) %.% summarize( sum(volume) )    

data_table <- function(dd) dd[, sum(volume), keyby=price]

dataframe 包已从 CRAN 中删除,随后与矩阵函数版本一起从测试中删除.

The dataframe package has been removed from CRAN and subsequently from the tests, along with the matrix function versions.

这是 i=5, j=8 基准测试结果:

Here's the i=5, j=8 benchmark results:

$`obs= 500,000 unique prices= 158,286 reps= 5`
                  test elapsed relative
9     data_table(d.dt)   0.074    1.000
4          dplyr(d.dt)   0.133    1.797
3          dplyr(d.df)   1.832   24.757
6        l.apply(d.df)   5.049   68.230
5        t.apply(d.df)   8.078  109.162
8            agg(d.df)  11.822  159.757
7            b.y(d.df)  48.569  656.338
2 plyr_Optimized(d.df) 148.030 2000.405
1  plyr_Original(d.df) 401.890 5430.946

毫无疑问,优化有点帮助.看看 d.df 函数;他们只是无法竞争.

No doubt the optimizing helped a bit. Take a look at the d.df functions; they just can't compete.

为了稍微了解一下 data.frame 结构的缓慢性,这里是使用更大的测试数据集 (i=8,j=8) 的 data_table 和 dplyr 的聚合时间的微基准.

For a little perspective on the slowness of the data.frame structure here are micro-benchmarks of the aggregation times of data_table and dplyr using a larger test dataset (i=8,j=8).

$`obs= 50,000,000 unique prices= 15,836,476 reps= 5`
Unit: seconds
             expr    min     lq median     uq    max neval
 data_table(d.dt)  1.190  1.193  1.198  1.460  1.574    10
      dplyr(d.dt)  2.346  2.434  2.542  2.942  9.856    10
      dplyr(d.df) 66.238 66.688 67.436 69.226 86.641    10

data.frame 仍然留在尘土中.不仅如此,这里还有用测试数据填充数据结构所经过的 system.time:

The data.frame is still left in the dust. Not only that, but here's the elapsed system.time to populate the data structures with the test data:

`d.df` (data.frame)  3.181 seconds.
`d.dt` (data.table)  0.418 seconds.

data.frame 的创建和聚合都比 data.table 慢.

使用 data.frame in R 比一些替代方案慢,但基准测试显示内置的 R 函数将 plyr 从水中吹走.即使像 dplyr 那样管理 data.frame,它改进了内置,也不能提供最佳速度;其中 data.table 在创建和聚合方面都更快 并且 data.table 在使用/处理 data.frames 时执行它所做的工作.

Working with the data.frame in R is slower than some alternatives but as the benchmarks show the built in R functions blow plyr out of the water. Even managing the data.frame as dplyr does, which improves upon the built-ins, doesn't give optimal speed; where as data.table is faster both in creation and aggregation and data.table does what it does while working with/upon data.frames.

最后……

Plyr 很慢,因为 处理和管理 data.frame 操作的方式.

Plyr is slow because of the way it works with and manages the data.frame manipulation.

[punt:: 查看对原始问题的评论].

[punt:: see the comments to the original question].

## R version 3.0.2 (2013-09-25)
## Platform: x86_64-pc-linux-gnu (64-bit)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] microbenchmark_1.3-0 rbenchmark_1.0.0     xts_0.9-7           
## [4] zoo_1.7-11           data.table_1.9.2     dplyr_0.1.2         
## [7] plyr_1.8.1           knitr_1.5.22        
## 
## loaded via a namespace (and not attached):
## [1] assertthat_0.1  evaluate_0.5.2  formatR_0.10.4  grid_3.0.2     
## [5] lattice_0.20-27 Rcpp_0.11.0     reshape2_1.2.2  stringr_0.6.2  
## [9] tools_3.0.2

数据生成 gist .rmd

这篇关于为什么 plyr 这么慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆