如何提高随机森林的性能? [英] How to improve randomForest performance?

查看:302
本文介绍了如何提高随机森林的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大小为 38 MB 的训练集(12 个属性,420000 行).我正在运行下面的 R 片段,以使用 randomForest 训练模型.这对我来说需要几个小时.

I have a training set of size 38 MB (12 attributes with 420000 rows). I am running the below R snippet, to train the model using randomForest. This is taking hours for me.

rf.model <- randomForest(
              Weekly_Sales~.,
              data=newdata,
              keep.forest=TRUE,
              importance=TRUE,
              ntree=200,
              do.trace=TRUE,
              na.action=na.roughfix
            )

我认为,由于na.roughfix,执行需要很长时间.训练集中有很多NA's.

I think, due to na.roughfix, it is taking long time to execute. There are so many NA's in the training set.

谁能告诉我如何提高性能?

Could someone let me know how can I improve the performance?

我的系统配置是:

Intel(R) Core i7 CPU @ 2.90 GHz
RAM - 8 GB
HDD - 500 GB
64 bit OS

推荐答案

(tl;dr 是你应该 a) 将节点大小增加到 >> 1 和 b) 排除非常低重要性的特征列,甚至可能排除(比如说)80% 的列.您的问题几乎肯定不是 na.roughfix,但如果您怀疑,请在调用 randomForest 之前将 na.roughfix 作为独立步骤单独运行.先把那条红鲱鱼移开.)

(The tl;dr is you should a) increase nodesize to >> 1 and b) exclude very low-importance feature columns, maybe even exclude (say) 80% of your columns. Your issue is almost surely not na.roughfix, but if you suspect that, run na.roughfix separately as a standalone step, before calling randomForest. Get that red herring out of the way at first.)

现在,以下所有建议仅在您超出内存限制之前适用,因此请衡量您的内存使用情况,并确保您没有超出.(从小得离谱的参数开始,然后放大它们,测量运行时间,并不断检查它没有不成比例地增加.)

Now, all of the following advice only applies until you blow out your memory limits, so measure your memory usage, and make sure you're not exceeding. (Start with ridiculously small parameters, then scale them up, measure the runtime, and keep checking it didn't increase disproportionately.)

影响randomForest性能的主要参数有:

The main parameters affecting the performance of randomForest are:

  • mtry(越少越快)
  • ntrees
  • 特征数量/数据中的列数 - 越多,速度越慢,或者更糟!见下文
  • 数据中的观察数/行
  • ncores(越多越快 - 只要使用并行选项)
  • 通过设置importance=F 和proximity=F(不计算邻近矩阵)来提升一些性能
  • 永远不要使用疯狂的默认nodesize=1,进行分类!在 Breiman 的包中,您不能直接设置 maxdepth,而是使用 nodesize 作为代理,并阅读以下所有好的建议:CrossValidated:关于调整随机森林的实用问题"
  • 所以这里你的数据有 4.2e+5 行,如果每个节点不应该小于 ~0.1%,试试 nodesize=42.(首先尝试 nodesize=420 (1%),看看它有多快,然后重新运行,向下调整节点大小.凭经验确定此数据集的合适节点大小.)
  • 运行时间与 ~ 2^D_max 成正比,即多项式与 (-log1p(nodesize)) 成正比
  • 或者,您也可以通过使用采样来加速,请参阅 strata,sampsize 参数
  • mtry (less is faster)
  • ntrees
  • number of features/cols in data - more is quadratically slower, or worse! See below
  • number of observations/rows in data
  • ncores (more is faster - as long as parallel option is being used)
  • some performance boost by setting importance=F and proximity=F (don't compute proximity matrix)
  • Never ever use the insane default nodesize=1, for classification! In Breiman's package, you can't directly set maxdepth, but use nodesize as a proxy for that, and also read all the good advice at: CrossValidated: "Practical questions on tuning Random Forests"
  • So here your data has 4.2e+5 rows, then if each node shouldn't be smaller than ~0.1%, try nodesize=42. (First try nodesize=420 (1%), see how fast it is, then rerun, adjusting nodesize down. Empirically determine a good nodesize for this dataset.)
  • runtime is proportional to ~ 2^D_max, i.e. polynomial to (-log1p(nodesize))
  • optionally you can also speedup by using sampling, see strata,sampsize arguments

那么运行时的一阶估计,表示mtry=M,ntrees=T,ncores=C,nfeatures=F,nrows=R,maxdepth=D_max,是:

Then a first-order estimate of runtime, denoting mtry=M, ntrees=T, ncores=C, nfeatures=F, nrows=R, maxdepth=D_max, is:

Runtime proportional to: T * F^2 * (R^1.something) * 2^D_max / C

(同样,如果超出内存,则所有赌注都将关闭.另外,尝试仅在一个核心上运行,然后是 2 核,然后是 4 核,并验证您确实获得了线性加速.而不是减速.)(大 R 的效果比线性更差,可能是二次的,因为树分区必须考虑数据行的所有分区;当然它比线性更差.通过使用采样或索引检查,只给出 10%行).

(Again, all bets are off if you exceed memory. Also, try running on only one core, then 2, then 4 and verify you actually do get linear speedup. And not slowdown.) (The effect of large R is worse than linear, maybe quadratic, since tree-partitioning has to consider all partitions of the data rows; certainly it's somewhat worse than linear. Check that by using sampling or indexing to only give it say 10% of rows).

提示:保留大量无用的低重要性特征会二次增加运行时间,从而实现次线性的准确性增加.这是因为在每个节点,我们必须考虑所有可能的特征选择(或任何 mtry 允许的数量).在每棵树内,我们必须考虑所有(F-choose-mtry)可能的特征组合.所以这是我的方法,进行快速而肮脏的性能选择":

Tip: keeping lots of crap low-importance features quadratically increases runtime, for a sublinear increase in accuracy. This is because at each node, we must consider all possible feature selection (or whatever number mtry) allows. And within each tree, we must consider all (F-choose-mtry) possible combinations of features. So here's my methodology, doing "fast-and-dirty feature-selection for performance":

  1. 正常(缓慢)生成树,但使用正常的nodesize=42 或更大
  2. 查看 rf$importances 或 randomForest::varImpPlot().只选择top-K个特征,你选择K的地方;对于愚蠢的快速示例,请选择 K=3.保存整个列表以备将来参考.
  3. 现在重新运行树,但只给它newdata[,importantCols]
  4. 确认速度是二次方的,oob.error 也不会差很多
  5. 一旦知道变量的重要性,就可以关闭importance=F
  6. 调整 mtry 和节点大小(一次调整一个),重新运行并测量速度提升
  7. 在对数轴上绘制您的表现结果
  8. 将结果发布给我们!你证实了上面的说法吗?对内存使用有什么意见吗?
  1. generate a tree normally (slow), although use a sane nodesize=42 or larger
  2. look at rf$importances or randomForest::varImpPlot(). Pick only the top-K features, where you choose K; for a silly-fast example, choose K=3. Save that entire list for future reference.
  3. now rerun the tree but only give it newdata[,importantCols]
  4. confirm that speed is quadratically faster, and oob.error is not much worse
  5. once you know your variable importances, you can turn off importance=F
  6. tweak mtry and nodesize (tweak one at a time), rerun and measure speed improvement
  7. plot your performance results on logarithmic axes
  8. post us the results! Did you corroborate the above? Any comments on memory usage?

(请注意,以上对于实际的特征选择并不是一个统计上有效的过程,不要依赖它,阅读 randomForest 包以了解基于 RF 的特征选择的实际正确方法.)

(Note that the above is not a statistically valid procedure for actual feature-selection, do not rely on it for that, read randomForest package for the actual proper methods for RF-based feature-selection.)

这篇关于如何提高随机森林的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆