R:使用"microbenchmark"和ggplot2绘制运行时 [英] R: Using "microbenchmark" and ggplot2 to plot runtimes

查看:64
本文介绍了R:使用"microbenchmark"和ggplot2绘制运行时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用R编程语言.我想学习如何随着数据大小的增加来测量和绘制差异程序的运行时间.

我找到了先前的stackoverflow帖子,它回答了一个类似的问题:绘制三个函数的运行时间

看来,微基准"R中的库应该能够完成此任务.

假设我模拟以下数据:

  #load库图书馆(微基准测试)图书馆(dplyr)库(ggplot2)图书馆(Rtsne)图书馆(集群)库(dbscan)图书馆(密谋)#模拟数据var_1<-rnorm(1000,1,4)var_2< -rnorm(1000,10,5)var_3<-sample(LETTERS [1:4],1000,replace = TRUE,prob = c(0.1,0.2,0.65,0.05))var_4<-sample(LETTERS [1:2],1000,replace = TRUE,prob = c(0.4,0.6))#将它们放入称为"f"的数据帧中.f<-data.frame(var_1,var_2,var_3,var_4)#将var_3和response_variable声明为因素f $ var_3 = as.factor(f $ var_3)f $ var_4 = as.factor(f $ var_4)#添加IDf $ ID<-seq_along(f [,1])现在,我要测量7个不同过程的运行时间:#程序1 ::gower_dist<-雏菊(f [,-5],metric ="gower")gower_mat<-as.matrix(gower_dist)#程序2lof<-lof(gower_dist,k = 3)#程序3lof<-lof(gower_dist,k = 5)#程序4tsne_obj<-Rtsne(gower_dist,is_distance = TRUE)tsne_data<-tsne_obj $ Y%>%data.frame()%&%setNames(c("X","Y"))%>%变异(名称= f $ ID)#程序5tsne_obj<-Rtsne(gower_dist,困惑度= 10,is_distance = TRUE)tsne_data<-tsne_obj $ Y%>%data.frame()%&%setNames(c("X","Y"))%>%变异(名称= f $ ID)#程序6地块= ggplot(aes(x = X,y = Y),data = tsne_data)+ geom_point(aes())#程序7tsne_obj<-Rtsne(gower_dist,is_distance = TRUE)tsne_data<-tsne_obj $ Y%>%data.frame()%&%;%setNames(c("X","Y"))%>%变异(名称= f $ ID,lof = lof,var1 = f $ var_1,var2 = f $ var_2,var3 = f $ var_3)p1<-ggplot(aes(x = X,y = Y,size = lof,key = name,var1 = var1,var2 = var2,var3 = var3),data = tsne_data)+geom_point(shape = 1,col ="red")+theme_minimal()ggplotly(p1,工具提示= c("lof",名称","var1","var2","var3"))) 

使用"microbenchmark"库中,我可以找出各个函数的时间:

  procedure_1_part_1<-microbenchmark(daisy(f [,-5],度量标准="gower"))程序_1_part_2<-微基准测试(as.matrix(gower_dist)) 

我想对运行时间进行如下绘制:

由于第2步和第3步的变化很小, k = 3 k = 5 ,因此在图表中几乎无法区分.

结论

结合包装函数和 lapply(),我们可以生成生成原始帖子中要求的图表所需的信息.

修改的一般模式是:

  1. 将原始过程包装到一个函数中,该函数可用作 microbenchmark()的分析单位,并包含一个 size 参数
  2. 在必要时修改过程以将 size 用作变量
  3. 基于 size 自变量
  4. 修改访问先前步骤中的对象的过程
  5. 修改过程以在后续过程步骤中需要时使用 assign() size 写入其输出

我们将自动执行基准测试程序4-7的数据帧大小,并将其集成到绘图中,作为原始海报的有趣练习.

I am using the R programming language. I want to learn how to measure and plot the run time of difference procedures as the size of the data increases.

I found a previous stackoverflow post that answers a similar question: Plot the run time of three functions

It seems that the "microbenchmark" library in R should be able to accomplish this task.

Suppose I simulate the following data:

#load libraries

library(microbenchmark)
library(dplyr)
library(ggplot2)
library(Rtsne)
library(cluster)
library(dbscan)
library(plotly)

#simulate data

var_1 <- rnorm(1000,1,4)
var_2<-rnorm(1000,10,5)
var_3 <- sample( LETTERS[1:4], 1000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05) )
var_4 <- sample( LETTERS[1:2], 1000, replace=TRUE, prob=c(0.4, 0.6) )


#put them into a data frame called "f"
f <- data.frame(var_1, var_2, var_3, var_4)

#declare var_3 and response_variable as factors
f$var_3 = as.factor(f$var_3)
f$var_4 = as.factor(f$var_4)

#add id
f$ID <- seq_along(f[,1])
Now, I want to measure the run time of 7 different procedures:

#Procedure 1: :

gower_dist <- daisy(f[,-5],
                    metric = "gower")

gower_mat <- as.matrix(gower_dist)


#Procedure 2

lof <- lof(gower_dist, k=3)

#Procedure 3

lof <- lof(gower_dist, k=5)

#Procedure 4

tsne_obj <- Rtsne(gower_dist,  is_distance = TRUE)

tsne_data <- tsne_obj$Y %>%
    data.frame() %>%
    setNames(c("X", "Y")) %>%
    mutate(
           name = f$ID)

#Procedure 5

tsne_obj <- Rtsne(gower_dist, perplexity =10,  is_distance = TRUE)

tsne_data <- tsne_obj$Y %>%
    data.frame() %>%
    setNames(c("X", "Y")) %>%
    mutate(
           name = f$ID)

#Procedure 6

plot = ggplot(aes(x = X, y = Y), data = tsne_data) + geom_point(aes())

#Procedure 7

tsne_obj <- Rtsne(gower_dist,  is_distance = TRUE)

tsne_data <- tsne_obj$Y %>%
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(
    name = f$ID, 
    lof=lof,
    var1=f$var_1,
    var2=f$var_2,
    var3=f$var_3
    )

p1 <- ggplot(aes(x = X, y = Y, size=lof, key=name, var1=var1, 
  var2=var2, var3=var3), data = tsne_data) + 
  geom_point(shape=1, col="red")+
  theme_minimal()

ggplotly(p1, tooltip = c("lof", "name", "var1", "var2", "var3"))

Using the "microbenchmark" library, I can find out the time of individual functions:

procedure_1_part_1 <- microbenchmark(daisy(f[,-5],
                    metric = "gower"))

procedure_1_part_2 <-  microbenchmark(as.matrix(gower_dist))

I want to make a graph of the run times like this:

https://umap-learn.readthedocs.io/en/latest/benchmarking.html

Question: Can someone please show me how to make this graph and use the microbenchmark statement for multiple functions at once (for different sizes of the dataframe "f" (for f = 5, 10, 50, 100, 200, 500, 100)?

microbench(cbind(gower_dist <- daisy(f[1:5,-5], metric = "gower"), gower_mat <- as.matrix(gower_dist))

microbench(cbind(gower_dist <- daisy(f[1:10,-5], metric = "gower"), gower_mat <- as.matrix(gower_dist))

microbench(cbind(gower_dist <- daisy(f[1:50,-5], metric = "gower"), gower_mat <- as.matrix(gower_dist))

etc

There does not seem to be a straightforward way to do this in R:

mean(procedure_1_part_1$time)
[1] NA

Warning message:
In mean.default(procedure_1_part_1) :
  argument is not numeric or logical: returning NA

I could manually run each one of these, copy the results into excel and plot them, but this would also take a long time.

 tm <- microbenchmark( daisy(f[,-5],
                        metric = "gower"),
    as.matrix(gower_dist))

 tm
Unit: microseconds
                             expr    min     lq     mean  median      uq    max neval cld
 daisy(f[, -5], metric = "gower") 2071.9 2491.4 3144.921 3563.65 3621.00 4727.8   100   b
            as.matrix(gower_dist)  129.3  147.5  194.709  180.80  232.45  414.2   100  a 

Is there a quicker way to make a graph?

Thanks

解决方案

Here is a solution that benchmarks & charts the first three procedures from the original post, and then charts their average run times with ggplot().

Setup

We start the process by executing the code necessary to create the data from the original post.

library(dplyr)
library(ggplot2)
library(Rtsne)
library(cluster)
library(dbscan)
library(plotly)
library(microbenchmark)

#simulate data

var_1 <- rnorm(1000,1,4)
var_2<-rnorm(1000,10,5)
var_3 <- sample( LETTERS[1:4], 1000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05) )
var_4 <- sample( LETTERS[1:2], 1000, replace=TRUE, prob=c(0.4, 0.6) )

#put them into a data frame called "f"
f <- data.frame(var_1, var_2, var_3, var_4,ID=1:1000)

#declare var_3 and response_variable as factors
f$var_3 = as.factor(f$var_3)
f$var_4 = as.factor(f$var_4)

Automation of the benchmarking process by data frame size

First, we create a vector of data frame sizes to drive the benchmarking.

# configure run sizes
sizes <- c(5,10,50,100,200,500,1000)

Next, we take the first procedure and alter it so we can vary the number of observations that are used from the data frame f. Note that since we need to use the outputs from this procedure in subsequent steps, we use assign() to write them to the global environment. We also include the number of observations in the object name so we can retrieve them by size in subsequent steps.

# Procedure 1: :
proc1 <- function(size){
    assign(paste0("gower_dist_",size), daisy(f[1:size,-5],
                        metric = "gower"),envir = .GlobalEnv)
        
    assign(paste0("gower_mat_",size),as.matrix(get(paste0("gower_dist_",size),envir = .GlobalEnv)),
           envir = .GlobalEnv)
        
}     

To run the benchmark by data frame size we use the sizes vector with lapply() and an anonymous function that executes proc1() repeatedly. We also assign the number of observations to a column called obs so we can use it in the plot.

proc1List <- lapply(sizes,function(x){
        b <- microbenchmark(proc1(x))
        b$obs <- x
        b
})

At this point we have one data frame per benchmark based on size. We combine the benchmarks into a single data frame with do.call() and rbind().

proc1summary <- do.call(rbind,(proc1List))

Next, we use the same process with procedures 2 and 3. Notice how we use get() with paste0() to retrieve the correct gower_dist objects by size.

#Procedure 2

proc2 <- function(size){
        lof <- lof(get(paste0("gower_dist_",size),envir = .GlobalEnv), k=3)
}
proc2List <- lapply(sizes,function(x){
    b <- microbenchmark(proc2(x))
    b$obs <- x
    b
})
proc2summary <- do.call(rbind,(proc2List))

#Procedure 3

proc3 <- function(size){
    lof <- lof(get(paste0("gower_dist_",size),envir = .GlobalEnv), k=5)
}

Since k must be less than the number of observations, we adjust the sizes vector to start at 10 for procedure 3.

# configure run sizes
sizes <- c(10,50,100,200,500,1000)

proc3List <- lapply(sizes,function(x){
    b <- microbenchmark(proc3(x))
    b$obs <- x
    b
})
proc3summary <- do.call(rbind,(proc3List))

Having generated runtime benchmarks for each of the first three procedures, we bind the summary data, summarize to means with dplyr::summarise(), and plot with ggplot().

do.call(rbind,list(proc1summary,proc2summary,proc3summary)) %>% 
    group_by(expr,obs) %>%
    summarise(.,time_ms = mean(time) * .000001) -> proc_time 

The resulting data frame has all the information we need to produce the chart: the procedure used, the number of observations in the original data frame, and the average time in milliseconds.

> head(proc_time)
# A tibble: 6 x 3
# Groups:   expr [1]
  expr       obs time_ms
  <fct>    <dbl>   <dbl>
1 proc1(x)     5   0.612
2 proc1(x)    10   0.957
3 proc1(x)    50   1.32 
4 proc1(x)   100   2.53 
5 proc1(x)   200   5.78 
6 proc1(x)   500  25.9 

Finally, we use ggplot() to produce an x y chart, grouping the lines by procedure used.

ggplot(proc_time,aes(obs,time_ms,group = expr)) +
    geom_line(aes(group = expr),color = "grey80") + 
    geom_point(aes(color = expr))

...and the output:

Since procedures 2 and 3 vary only slightly, k = 3 vs. k = 5, they are almost indistinguishable in the chart.

Conclusions

With a combination of wrapper functions and lapply() we can generate the information needed to produce the chart requested in the original post.

The general pattern of modifications is:

  1. Wrap the original procedure in a function that we can use as the unit of analysis for microbenchmark(), and include a size argument
  2. Modify the procedure to use size as a variable where necessary
  3. Modify the procedure to access objects from previous steps, based on the size argument
  4. Modify the procedure to write its outputs with assign() and size if these are needed for subsequent procedure steps

We leave automation of benchmarking procedures 4 - 7 by data frame size and integrating them into the plot as an interesting exercise for the original poster.

这篇关于R:使用"microbenchmark"和ggplot2绘制运行时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆