基准测试和处理时间结果的差异 [英] Discrepancies in Benchmark and Processing Time Results

查看:66
本文介绍了基准测试和处理时间结果的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试对替换数据帧中NA的最有效方法进行一些测试。



我首先比较了NA与0的替换。一百万行,十二列的数据集上的解决方案。
将所有支持管道的工具扔进 microbenchmark 中,我得到以下结果。



问题1:是否可以测试基准内部的子集左赋值语句(例如:df1 [is.na(df1)]< -0)

 库(dplyr)
库(tidyr)
库(微基准)

set.seed(24)
df1<-as.data.frame(matrix(sample(c(NA,1:5),1e6 * 12,replace = TRUE),
dimnames = list(NULL,paste0( var,1:12)),ncol = 12))

op<-microbenchmark(
mut_all_ifelse = df1%>%mutate_all( funs(ifelse(is.na(。),0,。))),
mut_at_ifelse = df1%>%mutate_at(funs(ifelse(is.na(。),0,。)),.cols = c(1:12)),
#df1 [is.na(df1)]<-0会坐在这里,但我无法使其在此函数内工作
replace = df1% >%replace(。,is.na(。),0),
mut_all_replace = df1%>%mutate_all(funs(rep lace(。,is.na(。),0)))),
mut_at_replace = df1%>%mutate_at(funs(replace(。,is.na(。),0)),.cols = c (1:12)),
replace_na = df1%>%replace_na(list(var1 = 0,var2 = 0,var3 = 0,var4 = 0,var5 = 0,var6 = 0,var7 = 0, var8 = 0,var9 = 0,var10 = 0,var11 = 0,var12 = 0)),
次= 1000L


print(op)#标准数据帧输出值
的单位:毫秒
expr min lq平均中位数uq最大neval
mut_all_ifelse 769.87848 844.5565 871.2476 856.0941 895.4545 1274.5610 1000
mut_at_ifelse 713.48399 847.0322 875.9433 861.3b 899.7b 1006.6767 1000替换258.85697 311.9708 334.2291 317.3889 360.6112 455.7596 1000
mut_all_replace 96.81479 164.1745 160.6151 167.5426 170.5497 219.5013 1000
mut_at_replace 96.23975 166.0804 161.9302 169.3984 172.7442 219.0359 1000
replace_na 308.7804.16600 160.9746 000
boxplot(op)#输出的箱形图

  library(ggplot2)#输出的不错的对数图
qplot (y =时间,数据=操作,颜色=表达式)+ scale_y_log10()



要测试子集赋值运算符,我最初运行了这些测试。

  set.seed(24)
> Book1<-as.data.frame(matrix(sample(c(NA,1:5),1e8 * 12,replace = TRUE),
+ dimnames = list(NULL,paste0( var,1 :12)),ncol = 12))
> system.time({
+ Book1%>%mutate_all(funs(ifelse(is.na(。),0,。)))})
用户系统已使用
52.79 24.66 77.45
>
> system.time({
+ Book1%>%mutate_at(funs(ifelse(is.na(。),0,。)),.cols = c(1:12))})
用户系统经过的时间
52.74 25.16 77.91
>
> system.time({
+ Book1 [is.na(Book1)]<-0})
用户系统已使用
16.65 7.86 24.51
>
> system.time({
+ Book1%>%replace_na(list(var1 = 0,var2 = 0,var3 = 0,var4 = 0,var5 = 0,var6 = 0,var7 = 0,var8 = 0 ,var9 = 0,var10 = 0,var11 = 0,var12 = 0))})
用户系统已使用
3.54 2.13 5.68
>
> system.time({
+ Book1%>%mutate_at(funs(replace(。,is.na(。),0)),.cols = c(1:12))})
用户系统经过的时间
3.37 2.26 5.63
>
> system.time({
+ Book1%>%mutate_all(funs(replace(。,is.na(。),0)))})
用户系统已使用
3.33 2.26 5.58
>
> system.time({
+ Book1%>%replace(。,is.na(。),0)})
用户系统使用时间
3.42 1.09 4.51

在这些测试中,基本 replace()首先出现。
在基准测试中,替换的排名下降得更远,而 tidyr replace_na()获胜(通过鼻子)
反复运行奇异的测试并且在不同形状和大小的数据帧上总会找到基本的 replace()领先。



问题2:如何将基准性能作为唯一与简单测试结果相去甚远的结果?



更令人困惑-
问题3:如何将所有 mutate_all / _at(replace())比简单的 replace()更快?
许多人报告此:



请注意 df1 保持不变( str(df1))。


I've been trying to do a little testing of the most efficient ways to replace NA's in dataframes.

I started with a comparison of of NA's to 0's replacement solutions on a 1 million row, 12 column dataset. Throwing all the pipe capable ones into microbenchmark I got the following results.

Question 1: Is there a way to test the subset left assignment statements (e.g.:df1[is.na(df1)] <- 0) inside the benchmark function?

library(dplyr)
library(tidyr)
library(microbenchmark)

set.seed(24)
df1 <- as.data.frame(matrix(sample(c(NA, 1:5), 1e6 *12, replace=TRUE),
                            dimnames = list(NULL, paste0("var", 1:12)), ncol=12))

op <- microbenchmark(
    mut_all_ifelse   = df1 %>% mutate_all(funs(ifelse(is.na(.), 0, .))),
    mut_at_ifelse    = df1 %>% mutate_at(funs(ifelse(is.na(.), 0, .)), .cols = c(1:12)),
    # df1[is.na(df1)] <- 0 would sit here, but I can't make it work inside this function
    replace          = df1 %>% replace(., is.na(.), 0),
    mut_all_replace  = df1 %>% mutate_all(funs(replace(., is.na(.), 0))),
    mut_at_replace   = df1 %>% mutate_at(funs(replace(., is.na(.), 0)), .cols = c(1:12)),
    replace_na       = df1 %>% replace_na(list(var1 = 0, var2 = 0, var3 = 0, var4 = 0, var5 = 0, var6 = 0, var7 = 0, var8 = 0, var9 = 0, var10 = 0, var11 = 0, var12 = 0)),
    times = 1000L
)

print(op) #standard data frame of the output
    Unit: milliseconds
            expr       min       lq     mean   median       uq       max neval
  mut_all_ifelse 769.87848 844.5565 871.2476 856.0941 895.4545 1274.5610  1000
   mut_at_ifelse 713.48399 847.0322 875.9433 861.3224 899.7102 1006.6767  1000
         replace 258.85697 311.9708 334.2291 317.3889 360.6112  455.7596  1000
 mut_all_replace  96.81479 164.1745 160.6151 167.5426 170.5497  219.5013  1000
  mut_at_replace  96.23975 166.0804 161.9302 169.3984 172.7442  219.0359  1000
      replace_na 103.04600 161.2746 156.7804 165.1649 168.3683  210.9531  1000
boxplot(op) #boxplot of output

library(ggplot2) #nice log plot of the output
qplot(y=time, data=op, colour=expr) + scale_y_log10()

To test the subset assignment operator I had run these tests originally.

set.seed(24) 
> Book1 <- as.data.frame(matrix(sample(c(NA, 1:5), 1e8 *12, replace=TRUE),
+ dimnames = list(NULL, paste0("var", 1:12)), ncol=12))
> system.time({ 
+     Book1 %>% mutate_all(funs(ifelse(is.na(.), 0, .))) })
   user  system elapsed 
  52.79   24.66   77.45 
> 
> system.time({ 
+     Book1 %>% mutate_at(funs(ifelse(is.na(.), 0, .)), .cols = c(1:12)) })
   user  system elapsed 
  52.74   25.16   77.91 
> 
> system.time({ 
+     Book1[is.na(Book1)] <- 0 })
   user  system elapsed 
  16.65    7.86   24.51 
> 
> system.time({ 
+     Book1 %>% replace_na(list(var1 = 0, var2 = 0, var3 = 0, var4 = 0, var5 = 0, var6 = 0, var7 = 0, var8 = 0, var9 = 0,var10 = 0, var11 = 0, var12 = 0)) })
   user  system elapsed 
   3.54    2.13    5.68 
> 
> system.time({ 
+     Book1 %>% mutate_at(funs(replace(., is.na(.), 0)), .cols = c(1:12)) })
   user  system elapsed 
   3.37    2.26    5.63 
> 
> system.time({ 
+     Book1 %>% mutate_all(funs(replace(., is.na(.), 0))) })
   user  system elapsed 
   3.33    2.26    5.58 
> 
> system.time({ 
+     Book1 %>% replace(., is.na(.), 0) })
   user  system elapsed 
   3.42    1.09    4.51 

In these tests the base replace() comes in first. In the benchmarking trials, the replace falls farther back in the ranks while the tidyr replace_na() wins (by a nose) Running the singular tests repeatedly and on different shapes and sizes of data frames always finds the base replace() in the lead.

Question 2: How could it's benchmark performance be the only result that falls so far out of line with the simple test results?

More perplexingly - Question 3: How can all the mutate_all/_at(replace()) work faster than the simple replace()? Many folks report this: http://datascience.la/dplyr-and-a-very-basic-benchmark/ (and all the links in that article) but I still haven't found an explanation for why beyond that hashing and C++ are used.)

with special thanks already to Tyler Rinker: https://www.r-bloggers.com/microbenchmarking-with-r/ and akrun: https://stackoverflow.com/a/41530071/5088194

解决方案

You can include a complex/multi statement in microbenchmark by wrapping it with {} which, basically, converts to a single expression:

microbenchmark(expr1 = { df1[is.na(df1)] = 0 }, 
               exp2 = { tmp = 1:10; tmp[3] = 0L; tmp2 = tmp + 12L; tmp2 ^ 2 }, 
               times = 10)
#Unit: microseconds
#  expr        min         lq       mean     median         uq        max neval cld
# expr1 124953.716 137244.114 158576.030 142405.685 156744.076 284779.353    10   b
#  exp2      2.784      3.132     17.748     23.142     24.012     38.976    10  a 

Worth noting the side-effects of this:

tmp
#[1]  1  2  0  4  5  6  7  8  9 10

in contrast to, say, something like:

rm(tmp)
microbenchmark(expr1 = { df1[is.na(df1)] = 0 },  
               exp2 = local({ tmp = 1:10; tmp[3] = 0L; tmp2 = tmp + 12L; tmp2 ^ 2 }), 
               times = 10)
#Unit: microseconds
#  expr       min         lq        mean     median         uq        max neval cld
# expr1 127250.18 132935.149 165296.3030 154509.553 169917.705 314820.306    10   b
#  exp2     10.44     12.181     42.5956     54.636     57.072     97.789    10  a 
tmp
#Error: object 'tmp' not found

Noticing the side effect a benchmark has, we see that the first operation that removes NA values leaves a fairly light job for the following alternatives:

# re-assign because we changed it before
set.seed(24)
df1 = as.data.frame(matrix(sample(c(NA, 1:5), 1e6 * 12, TRUE), 
                           dimnames = list(NULL, paste0("var", 1:12)), ncol = 12))
unique(sapply(df1, typeof))
#[1] "integer"
any(sapply(df1, anyNA))
#[1] TRUE
system.time({ df1[is.na(df1)] <- 0 })
# user  system elapsed 
# 0.39    0.14    0.53 

The previous benchmark leaves us with:

unique(sapply(df1, typeof))
#[1] "double"
any(sapply(df1, anyNA))
#[1] FALSE

And replacing NA, when there are none, is/should be taken account to do nothing on the input.

Aside of that, note that in all your alternatives you sub-assign a "double" (typeof(0)) to "integer" columns-vectors (sapply(df1, typeof)). While, I don't think there is any case (in the above alternatives) where df1 gets modified in place (since after the creation of a "data.frame", there is stored info to copy its vector-columns in case of modification), there -still- is a minor but avoidable overhead in coercing to "double" and storing as a "double". R before replacing elements in an "integer" vector will allocate and copy (in case of "integer" replacement) or allocate and coerce (in case of "double" replacement). Also, after the first coercion (from a side effect of a benchmark, as noted above), R will operate on "double"s and that contains slower manipulations than on "integer"s. I can't find a straightforward R way to investigate that difference, but in a nutshell (in danger of not being totally accurate) we can simulate these operations by:

# simulate R's copying of int to int
# allocate a new int and copy
int2int = inline::cfunction(sig = c(x = "integer"), body = '
    SEXP ans = PROTECT(allocVector(INTSXP, LENGTH(x)));
    memcpy(INTEGER(ans), INTEGER(x), LENGTH(x) * sizeof(int));
    UNPROTECT(1);
    return(ans);
')
# R's coercing of int to double
# 'coerceVector', internally, allocates a double and coerces to populate it
int2dbl = inline::cfunction(sig = c(x = "integer"), body = '
    SEXP ans = PROTECT(coerceVector(x, REALSXP));
    UNPROTECT(1);
    return(ans);
')
# simulate R's copying form double to double
dbl2dbl = inline::cfunction(sig = c(x = "double"), body = '
    SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));
    memcpy(REAL(ans), REAL(x), LENGTH(x) * sizeof(double));
    UNPROTECT(1);
    return(ans);
')

And on a benchmark:

x.int = 1:1e7; x.dbl = as.numeric(x.int)
microbenchmark(int2int(x.int), int2dbl(x.int), dbl2dbl(x.dbl), times = 50)
#Unit: milliseconds
#           expr      min       lq     mean   median       uq      max neval cld
# int2int(x.int) 16.42710 16.91048 21.93023 17.42709 19.38547 54.36562    50  a 
# int2dbl(x.int) 35.94064 36.61367 47.15685 37.40329 63.61169 78.70038    50   b
# dbl2dbl(x.dbl) 33.51193 34.18427 45.30098 35.33685 63.45788 75.46987    50   b

To conclude(!) the whole previous note, replacing 0 with 0L shall save some time...

Finally to replicate the benchmark in a more fair way, we could use:

library(dplyr)
library(tidyr)
library(microbenchmark) 
set.seed(24)
df1 = as.data.frame(matrix(sample(c(NA, 1:5), 1e6 * 12, TRUE), 
                            dimnames = list(NULL, paste0("var", 1:12)), ncol = 12))

Wrap in functions:

stopifnot(ncol(df1) == 12)  #some of the alternatives are hardcoded to 12 columns
mut_all_ifelse = function(x, val) x %>% mutate_all(funs(ifelse(is.na(.), val, .)))
mut_at_ifelse = function(x, val) x %>% mutate_at(funs(ifelse(is.na(.), val, .)), .cols = c(1:12))
baseAssign = function(x, val) { x[is.na(x)] <- val; x }
baseFor = function(x, val) { for(j in 1:ncol(x)) x[[j]][is.na(x[[j]])] = val; x }
base_replace = function(x, val) x %>% replace(., is.na(.), val)
mut_all_replace = function(x, val) x %>% mutate_all(funs(replace(., is.na(.), val)))
mut_at_replace = function(x, val) x %>% mutate_at(funs(replace(., is.na(.), val)), .cols = c(1:12))
myreplace_na = function(x, val) x %>% replace_na(list(var1 = val, var2 = val, var3 = val, var4 = val, var5 = val, var6 = val, var7 = val, var8 = val, var9 = val, var10 = val, var11 = val, var12 = val))

Test for equality of results before the benchmarks:

identical(mut_all_ifelse(df1, 0), mut_at_ifelse(df1, 0))
#[1] TRUE
identical(mut_at_ifelse(df1, 0), baseAssign(df1, 0))
#[1] TRUE
identical(baseAssign(df1, 0), baseFor(df1, 0))
#[1] TRUE
identical(baseFor(df1, 0), base_replace(df1, 0))
#[1] TRUE
identical(base_replace(df1, 0), mut_all_replace(df1, 0))
#[1] TRUE
identical(mut_all_replace(df1, 0), mut_at_replace(df1, 0))
#[1] TRUE
identical(mut_at_replace(df1, 0), myreplace_na(df1, 0))
#[1] TRUE

Test with coercing to "double":

benchnum = microbenchmark(mut_all_ifelse(df1, 0), 
                          mut_at_ifelse(df1, 0), 
                          baseAssign(df1, 0), 
                          baseFor(df1, 0),
                          base_replace(df1, 0), 
                          mut_all_replace(df1, 0),
                          mut_at_replace(df1, 0), 
                          myreplace_na(df1, 0),
                          times = 10)
benchnum
#Unit: milliseconds
#                    expr       min        lq      mean    median        uq       max neval cld
#  mut_all_ifelse(df1, 0) 1368.5091 1441.9939 1497.5236 1509.2233 1550.1416 1629.6959    10   c
#   mut_at_ifelse(df1, 0) 1366.1674 1389.2256 1458.1723 1464.5962 1503.4337 1553.7110    10   c
#      baseAssign(df1, 0)  532.4975  548.9444  586.8198  564.3940  655.8083  667.8634    10  b 
#         baseFor(df1, 0)  169.6048  175.9395  206.7038  189.5428  197.6472  308.6965    10 a  
#    base_replace(df1, 0)  518.7733  547.8381  597.8842  601.1544  643.4970  666.6872    10  b 
# mut_all_replace(df1, 0)  169.1970  183.5514  227.1978  194.0903  291.6625  346.4649    10 a  
#  mut_at_replace(df1, 0)  176.7904  186.4471  227.3599  202.9000  303.4643  309.2279    10 a  
#    myreplace_na(df1, 0)  172.4926  177.8518  199.1469  186.3645  192.1728  297.0419    10 a

Test without coercing to "double":

benchint = microbenchmark(mut_all_ifelse(df1, 0L), 
                          mut_at_ifelse(df1, 0L), 
                          baseAssign(df1, 0L), 
                          baseFor(df1, 0L),
                          base_replace(df1, 0L), 
                          mut_all_replace(df1, 0L),
                          mut_at_replace(df1, 0L),
                          myreplace_na(df1, 0L),
                          times = 10)
benchint
#Unit: milliseconds
#                     expr        min        lq      mean    median        uq       max neval cld
#  mut_all_ifelse(df1, 0L) 1291.17494 1313.1910 1377.9265 1353.2812 1417.4389 1554.6110    10   c
#   mut_at_ifelse(df1, 0L) 1295.34053 1315.0308 1372.0728 1353.0445 1431.3687 1478.8613    10   c
#      baseAssign(df1, 0L)  451.13038  461.9731  477.3161  471.0833  484.9318  528.4976    10  b 
#         baseFor(df1, 0L)   98.15092  102.4996  115.7392  107.9778  136.2227  139.7473    10 a  
#    base_replace(df1, 0L)  428.54747  451.3924  471.5011  470.0568  497.7088  516.1852    10  b 
# mut_all_replace(df1, 0L)  101.66505  102.2316  137.8128  130.5731  161.2096  243.7495    10 a  
#  mut_at_replace(df1, 0L)  103.79796  107.2533  119.1180  112.1164  127.7959  166.9113    10 a  
#    myreplace_na(df1, 0L)  100.03431  101.6999  120.4402  121.5248  137.1710  141.3913    10 a

And a simple way to visualize:

boxplot(benchnum, ylim = range(min(summary(benchint)$min, summary(benchnum)$min),
                               max(summary(benchint)$max, summary(benchnum)$max)))
boxplot(benchint, add = TRUE, border = "red", axes = FALSE) 
legend("topright", c("coerce", "not coerce"), fill = c("black", "red"))                       

Note that df1 is unchanged after all this (str(df1)).

这篇关于基准测试和处理时间结果的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆