提高插入符号(R)中模型训练的速度 [英] Improving model training speed in caret (R)
问题描述
我有一个包含20个要素和大约300,000个观测值的数据集.我正在使用插入符号来训练带有doParallel和四个核心的模型.对于我尝试过的方法(射频,nnet,adabag,svmPoly),即使对10%的数据进行培训也要花费八个多小时.我正在使用自举进行3次重采样,而tuneLength是5.是否有什么我可以做的来加快这个令人痛苦的缓慢过程?有人建议使用基础库可以使我的处理速度提高10倍,但是在我走那条路线之前,我想确保没有其他选择.
I have a dataset consisting of 20 features and roughly 300,000 observations. I'm using caret to train model with doParallel and four cores. Even training on 10% of my data takes well over eight hours for the methods I've tried (rf, nnet, adabag, svmPoly). I'm resampling with with bootstrapping 3 times and my tuneLength is 5. Is there anything I can do to speed up this agonizingly slow process? Someone suggested using the underlying library can speed up my the process as much as 10x, but before I go down that route I'd like to make sure there is no other alternative.
推荐答案
@phiver碰到了头,但是对于这种情况,有几点建议:
@phiver hits the nail on the head but, for this situation, there are a few things to suggest:
- 确保使用并行处理不会耗尽系统内存.使用 X 工作程序时,您正在制作 X 内存中数据的额外副本.
- 由于类不平衡,其他采样可以提供帮助.下采样可能有助于提高性能并减少时间.
- 使用不同的库. Ranger 代替 xgboost 或 C5.0 而不是竞速型算法,可在更短的时间内调整参数
- github上的开发版本对具有很多调整参数的模型具有随机搜索方法.
- make sure that you are not exhausting your system memory by using parallel processing. You are making X extra copies of the data in memory when using X workers.
- with a class imbalance, additional sampling can help. Downsampling might help improve performance and take less time.
- use different libraries. ranger instead of randomForest, xgboost or C5.0 instead of gbm. You should realize that ensemble methods are fitting a ton of constituent models and a bound to take a while to fit.
- the package has a racing-type algorithm for tuning parameters in less time
- the development version on github has random search methods for the models with a lot of tuning parameters.
最大
这篇关于提高插入符号(R)中模型训练的速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!