使用相同任务的 svm 和 ranger 的不同运行时 [英] Different runtime for svm and ranger using the same task

查看:35
本文介绍了使用相同任务的 svm 和 ranger 的不同运行时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对两个学习器的运行时间进行了基准测试,并在 {ranger} 和 {svm} 进行训练时截取了 {htop} 的两个屏幕截图,以使我的观点更加清晰.正如这篇文章的标题所述,我的问题是为什么 svm 中的训练/预测与其他学习者(在本例中为 ranger)相比如此慢?是否与学习者的底层结构有关?或者我在代码中犯了错误?或者...?任何帮助表示赞赏.

I've bench-marked the runtime of the two learners and also took two screenshots of the {htop} while {ranger} and {svm} was training to make my point more clearer. As stated in the title of this post, my question is the reason Why train/predict in svm is so slow compare to other learners (in this case ranger)? Is it related to the underlying structure of the learners? Or I am making a mistake in the code? Or...? Any help is appreciated.

游侠训练时的htop;使用所有线程.

htop when ranger is training; all threads are used.

svm 训练时的 htop;只使用了 2 个线程.

htop when svm is training; only 2 threads are used.

代码:

library(mlr3verse)

library(future.apply)
#> Loading required package: future
library(future)
library(lgr)
library(data.table)

file <- 'https://drive.google.com/file/d/1Y3ao-SOFH/view?usp=sharing' 
#you can download the file using this link.
destfile = '/../Downloads/sample.RData'
download.file(file, destfile)


# pre-processing
sample$clc3_red <- as.factor(sample$clc3_red)
sample$X <- NULL
#> Warning in set(x, j = name, value = value): Column 'X' does not exist to remove
sample$confidence <- NULL
sample$ecoregion_id <- NULL
sample$gsw_occurrence_1984_2019 <- NULL

tsk_clf <- mlr3::TaskClassif$new(id = 'sample', backend = sample, target = "clc3_red")
tsk_clf$col_roles$group = 'tile_id' #spatial CV
tsk_clf$col_roles$feature = setdiff(tsk_clf$col_roles$feature ,  'tile_id')
tsk_clf$col_roles$feature = setdiff(tsk_clf$col_roles$feature ,  'x')
tsk_clf$col_roles$feature = setdiff(tsk_clf$col_roles$feature ,  'y')

# 2 learners for benchmarking
svm <- lrn("classif.svm", type = "C-classification", kernel = "radial", predict_type = "response")
ranger <- lrn("classif.ranger", predict_type = "response", importance = "permutation")

# ranger parallel
plan(multicore)
time <- Sys.time()
ranger$
  train(tsk_clf)$
  predict(tsk_clf)$
  score()
#> Warning: Dropped unused factor level(s) in dependent variable: 333, 335, 521,
#> 522.
#> classif.ce 
#>     0.0116
Sys.time() - time
#> Time difference of 20.12981 secs

# svm parallel
plan(multicore)
time <- Sys.time()
svm$
  train(tsk_clf)$
  predict(tsk_clf)$
  score()
#> classif.ce 
#>     0.4361
Sys.time() - time
#> Time difference of 55.13694 secs

reprex 包 (v0.3.0) 于 2021 年 1 月 12 日创建

Created on 2021-01-12 by the reprex package (v0.3.0)

sessionInfo()
#> R version 4.0.2 (2020-06-22)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /opt/microsoft/ropen/4.0.2/lib64/R/lib/libRblas.so
#> LAPACK: /opt/microsoft/ropen/4.0.2/lib64/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] data.table_1.12.8   lgr_0.3.4           future.apply_1.6.0 
#>  [4] future_1.18.0       mlr3verse_0.1.3     paradox_0.3.0      
#>  [7] mlr3viz_0.1.1       mlr3tuning_0.1.2    mlr3pipelines_0.1.3
#> [10] mlr3learners_0.2.0  mlr3filters_0.2.0   mlr3_0.3.0         
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.5         compiler_4.0.2     pillar_1.4.6       highr_0.8         
#>  [5] class_7.3-17       mlr3misc_0.3.0     tools_4.0.2        digest_0.6.25     
#>  [9] uuid_0.1-4         lattice_0.20-41    evaluate_0.14      lifecycle_0.2.0   
#> [13] tibble_3.0.3       checkmate_2.0.0    gtable_0.3.0       pkgconfig_2.0.3   
#> [17] rlang_0.4.7        Matrix_1.2-18      parallel_4.0.2     yaml_2.2.1        
#> [21] xfun_0.15          e1071_1.7-3        ranger_0.12.1      withr_2.2.0       
#> [25] stringr_1.4.0      knitr_1.29         globals_0.12.5     vctrs_0.3.2       
#> [29] grid_4.0.2         glue_1.4.1         listenv_0.8.0      R6_2.4.1          
#> [33] rmarkdown_2.3      ggplot2_3.3.2      magrittr_1.5       mlr3measures_0.2.0
#> [37] codetools_0.2-16   backports_1.1.8    scales_1.1.1       htmltools_0.5.0   
#> [41] ellipsis_0.3.1     colorspace_1.4-1   stringi_1.4.6      munsell_0.5.0     
#> [45] crayon_1.3.4

推荐答案

完全可以预期 SVM 的运行时间比随机森林长,因为 SVM 随着观察数量和特征数量的增加而严重扩展.使 SVM 更昂贵的事情:

It is totally expected that an SVM runs longer than a random forest since SVMs scale badly with increasing number of observations and number of features. Things that make SVM more expensive:

  • Multiclass:在这里,对于每个类,训练一个 SVM,因为 SVM 本身就可以解决二分类
  • 许多分类变量会增加特征的数量,因为它们通过虚拟编码转换为数字特征

有更快的近似 SVM,但通常不值得进一步研究,因为它们的性能通常优于基于树的方法(随机森林)或增强方法 (xgboost).

There are approximate SVMs that are faster, but it's usually not worth to investigate further since they are often outperformed by tree based approaches (random forest) or boosting approaches (xgboost).

另见:为什么我的 SVM 会这样运行多久?火车SVM需要多长时间分类器?

这篇关于使用相同任务的 svm 和 ranger 的不同运行时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆