并行化随机森林 [英] Parallelizing random forests
问题描述
通过搜索和询问,我发现了许多可以用来利用服务器所有核心的软件包,以及许多可以进行随机森林操作的软件包.
Through searching and asking, I've found many packages I can use to make use of all the cores of my server, and many packages that can do random forest.
我在这方面还很陌生,在迷惑我的随机森林训练的所有方法之间我迷路了.您能否就使用和/或避免使用它们的原因,或它们的某些特定组合(有或没有caret
吗?)给出一些建议?
I'm quite new at this, and I'm getting lost between all the ways to parallelize the training of my random forest. Could you give some advice on reasons to use and/or avoid each of them, or some specific combinations of them (and with or without caret
?) that have made their proof ?
用于并行化的软件包:
doParallel
,
doSNOW
,
doSMP
(已停产?),
doMC
(以及mclapply
呢?)
随机森林的软件包:
[caret
+以下内容中的一些]
[caret
+ some of the following]
rf
,
parRF
,
randomForest
,
ranger
,
Rborist
,
parallelRandomForest
(破坏了我的R Studio会话...)
parallelRandomForest
(crashes my R Studio session...)
谢谢
推荐答案
SO上有一些答案,例如有关加速随机森林的建议,我来看看.
There are a few answers on SO, such as parallel execution of random forest in R and Suggestions for speeding up Random Forests, that I would take a look at.
这些帖子很有帮助,但年龄稍大. ranger
软件包是随机森林的一种特别快速的实现,因此,如果您不熟悉它,它可能是加快模型训练的最简单方法. 他们的论文讨论了一些可用软件包的取舍-取决于您的数据大小和数量功能,哪个包可为您带来最佳性能.
Those posts are helpful, but are a bit older. the ranger
package is an especially fast implementation of random forest, so if you are new to this it might be the easiest way to speed up your model training. Their paper discusses the tradeoffs of some of the available packages - depending on your data size and number of features, which package gives you the best performance will vary.
这篇关于并行化随机森林的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!