提升ggplot2的性能 [英] Boosting ggplot2 performance

查看:128
本文介绍了提升ggplot2的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

ggplot2 软件包很容易成为我曾经使用过的最好的绘图系统,除了性能对于大型数据集(约50k点)并不是很好。我正在研究通过Shiny提供Web分析,使用 ggplot2 作为绘图后端,但我对性能并不满意,尤其是与基本图形相反。我的问题是如果有任何具体的方法来提高这种表现。

出发点是以下代码示例:

  library( ggplot2)

n = 86400#a day in seconds
dat = data.frame(id = 1:n,val = sort(runif(n)))

dev.new()

gg_base = ggplot(dat,aes(x = id,y = val))
gg_point = gg_base + geom_point()
gg_line = gg_base + geom_line()
gg_both = gg_base + geom_point()+ geom_line()

benchplot(gg_point)
benchplot(gg_line)
benchplot(gg_both)
system.time(plot(dat))
system.time(plot(dat,type ='l'))

我在MacPro视网膜上获得以下时间点:

 > benchplot(gg_point)
step user.self sys.self已过期
1构造0.000 0.000 0.000
2 build 0.321 0.078 0.398
3渲染0.271 0.088 0.359
4绘制2.013 0.018 2.218
5合计2.605 0.184 2.975
> benchplot(gg_line)
step user.self sys.self已过期
1构造0.000 0.000 0.000
2构建0.330 0.073 0.403
3渲染0.622 0.095 0.717
4绘制2.078 0.009 2.266
5合计3.030 0.177 3.386
> benchplot(gg_both)
step user.self sys.self已过期
1构造0.000 0.000 0.000
2 build 0.602 0.155 0.757
3渲染0.866 0.186 1.051
4绘制4.020 0.030 4.238
5合计5.488 0.371 6.046
> system.time(plot(dat))
用户系统已用
1.133 0.004 1.138
#请注意,以下时间在很大程度上取决于图形设备
#或不。没有人认为性能好得多,好多了。
> system.time(plot(dat,type ='l'))
用户系统已用
1.230 0.003 1.233

有关我的设置的更多信息:

 > sessionInfo()
R版本2.15.3(2013-03-01)
平台:x86_64-apple-darwin9.8.0 / x86_64(64位)

语言环境:
[1] C / UTF-8 / C / C / C / C

附加的基本软件包:
[1] stats graphics grDevices utils datasets methods base

其他附加软件包:
[1] ggplot2_0.9.3.1

通过命名空间加载(并未附加):
[1] MASS_7.3-23 RColorBrewer_1 .0-5 colorspace_1.2-1 dichromat_2.0-0
[5] digest_0.6.3 grid_2.15.3 gtable_0.1.2 labeling_0.1
[9] munsell_0.4 plyr_1.8 proto_0.3- 10 reshape2_1.2.2
[13] scales_0.2.3 stringr_0.6.2


解决方案

哈德雷有一个很酷的谈话他的新软件包 dplyr ggvis at user2013。但他可能会更好地告诉他更多关于他自己的信息。



我不确定您的应用程序设计是什么样子,但我经常在喂食前进行数据库内预处理例如,如果绘制时间序列,则实际上不需要在X轴上显示每一秒的每一秒。相反,你可能想要聚合并获得例如最小/最大/平均值。一到五分钟的时间间隔。

下面是我多年前写的一个函数的例子,它在SQL中做了类似的事情。这个特殊的例子使用模运算符,因为时间存储为纪元毫秒。但是,如果SQL中的数据正确存储为日期/日期时间结构,那么SQL有一些更优雅的本地方法可以按时间段进行聚合。

 #'@param表格名称
#'@param开始时间/日期
#'@param结束时间/日期
#'@param聚合days ,hours,mins或weeks
#'@param group grouping变量
#'@param目标列的列名称(y轴)
#'@export
minmaxdata < - 函数(表,开始,结束,聚合= c(天,小时,分钟,星期),组= 1,列){

#dates
start< - round(unclass(as.POSIXct(start))* 1000);
end < - round(unclass(as.POSIXct(end))* 1000);

#必须汇总
汇总< - match.arg(汇总);
$ b $ calcluate modulus
mod < - switch(aggregate,
mins= 1000 * 60,
hours= 1000 * 60 * 60,
days= 1000 * 60 * 60 * 24,
weeks= 1000 * 60 * 60 * 24 * 7,
stop(无效累计值)
) ;

#我们需要添加gmt和pst之间的时间差以使模数工作
delta < - 1000 * 60 * 60 *(24 - unclass(as.POSIXct(format(Sys .time(),tz =GMT)) - Sys.time()));
$ b #form query
query< - paste(SELECT,group,AS grouping,AVG(,column,)AS yavg,MAX(,column,) AS ymax,MIN(,column,)AS ymin,((CMilliseconds_g +,delta,)DIV,mod,)AS timediv FROM,table,WHERE CMilliseconds_g BETWEEN,start,AND结束,GROUP BY,组,,timediv;)
mydata< - getquery(query);
$ b $ #data
mydata $ time< - structure(mod * mydata [[timediv]] / 1000 - delta / 1000,class = c(POSIXct,POSIXt ));
mydata $ grouping< - as.factor(mydata $ grouping)

#round timestamps
if(aggregate%in%c(mins,hours)) {
mydata $ time< - round(mydata $ time,aggregate)
} else {
mydata $ time< - as.Date(mydata $ time);
}

#return
return(mydata)
}


The ggplot2 package is easily the best plotting system I ever worked with, except that the performance is not really good for larger datasets (~50k points). I'm looking into providing web analyses through Shiny, using ggplot2 as the plotting backend, but I'm not really happy with the performance, especially in contrast with base graphics. My question is if there any concrete ways to increase this performance.

The starting point is the following code example:

library(ggplot2)

n = 86400 # a day in seconds
dat = data.frame(id = 1:n, val = sort(runif(n)))

dev.new()

gg_base = ggplot(dat, aes(x = id, y = val))
gg_point = gg_base + geom_point()
gg_line = gg_base + geom_line()
gg_both = gg_base + geom_point() + geom_line()

benchplot(gg_point)
benchplot(gg_line)
benchplot(gg_both)
system.time(plot(dat))
system.time(plot(dat, type = 'l'))

I get the following timings on my MacPro retina:

> benchplot(gg_point)
       step user.self sys.self elapsed
1 construct     0.000    0.000   0.000
2     build     0.321    0.078   0.398
3    render     0.271    0.088   0.359
4      draw     2.013    0.018   2.218
5     TOTAL     2.605    0.184   2.975
> benchplot(gg_line)
       step user.self sys.self elapsed
1 construct     0.000    0.000   0.000
2     build     0.330    0.073   0.403
3    render     0.622    0.095   0.717
4      draw     2.078    0.009   2.266
5     TOTAL     3.030    0.177   3.386
> benchplot(gg_both)
       step user.self sys.self elapsed
1 construct     0.000    0.000   0.000
2     build     0.602    0.155   0.757
3    render     0.866    0.186   1.051
4      draw     4.020    0.030   4.238
5     TOTAL     5.488    0.371   6.046
> system.time(plot(dat))
   user  system elapsed 
  1.133   0.004   1.138 
# Note that the timing below depended heavily on wether or net the graphics device
# was in view or not. Not in view made performance much, much better.
> system.time(plot(dat, type = 'l'))
   user  system elapsed 
  1.230   0.003   1.233 

Some more info on my setup:

> sessionInfo()
R version 2.15.3 (2013-03-01)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] C/UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_0.9.3.1

loaded via a namespace (and not attached):
 [1] MASS_7.3-23        RColorBrewer_1.0-5 colorspace_1.2-1   dichromat_2.0-0   
 [5] digest_0.6.3       grid_2.15.3        gtable_0.1.2       labeling_0.1      
 [9] munsell_0.4        plyr_1.8           proto_0.3-10       reshape2_1.2.2    
[13] scales_0.2.3       stringr_0.6.2     

解决方案

Hadley had a cool talk about his new packages dplyr and ggvis at user2013. But he can probably better tell more about that himself.

I'm not sure what your application design looks like, but I often do in-database pre-processing before feeding the data to R. For example, if you are plotting time series, there is really no need to show every second of the day on the X axis. Instead you might want to aggregate and get the min/max/mean over e.g. one or five minute time intervals.

Below an example of a function I wrote years ago that did something like that in SQL. This particular example uses the modulo operator because times were stored as epoch millis. But if data in SQL are properly stored as date/datetime structures, SQL has some more elegant native methods to aggregate by time periods.

#' @param table name of the table
#' @param start start time/date
#' @param end end time/date
#' @param aggregate one of "days", "hours", "mins" or "weeks"
#' @param group grouping variable
#' @param column name of the target column (y axis)
#' @export
minmaxdata <- function(table, start, end, aggregate=c("days", "hours", "mins", "weeks"), group=1, column){

  #dates
  start <- round(unclass(as.POSIXct(start))*1000);
  end <- round(unclass(as.POSIXct(end))*1000);

  #must aggregate
  aggregate <- match.arg(aggregate);

  #calcluate modulus
  mod <- switch(aggregate,
    "mins"   = 1000*60,
    "hours"  = 1000*60*60,
    "days"   = 1000*60*60*24,
    "weeks"  = 1000*60*60*24*7,
    stop("invalid aggregate value")
  );

  #we need to add the time differene between gmt and pst to make modulo work
  delta <- 1000 * 60 * 60 * (24 - unclass(as.POSIXct(format(Sys.time(), tz="GMT")) - Sys.time()));  

  #form query
  query <- paste("SELECT", group, "AS grouping, AVG(", column, ") AS yavg, MAX(", column, ") AS ymax, MIN(", column, ") AS ymin, ((CMilliseconds_g +", delta, ") DIV", mod, ") AS timediv FROM", table, "WHERE CMilliseconds_g BETWEEN", start, "AND", end, "GROUP BY", group, ", timediv;")
  mydata <- getquery(query);

  #data
  mydata$time <- structure(mod*mydata[["timediv"]]/1000 - delta/1000, class=c("POSIXct", "POSIXt"));
  mydata$grouping <- as.factor(mydata$grouping)

  #round timestamps
  if(aggregate %in% c("mins", "hours")){
    mydata$time <- round(mydata$time, aggregate)
  } else {
    mydata$time <- as.Date(mydata$time);
  }

  #return
  return(mydata)
}

这篇关于提升ggplot2的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆