为随机森林回归模型设置 ntree 和 mtry 的值 [英] setting values for ntree and mtry for random forest regression model

查看:515
本文介绍了为随机森林回归模型设置 ntree 和 mtry 的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 R 包 randomForest 对一些生物数据进行回归.我的训练数据大小是 38772 X 201.

I'm using R package randomForest to do a regression on some biological data. My training data size is 38772 X 201.

我只是想知道 --- 树的数量 ntree 和每个级别的变量数量 mtry 的合适值是多少?有没有近似公式可以找到这样的参数值?

I just wondered---what would be a good value for the number of trees ntree and the number of variable per level mtry? Is there an approximate formula to find such parameter values?

我输入数据中的每一行是一个 200 个字符,代表氨基酸序列,我想建立一个回归模型来使用这样的序列来预测蛋白质之间的距离.

Each row in my input data is a 200 character representing the amino acid sequence, and I want to build a regression model to use such sequence in order to predict the distances between the proteins.

推荐答案

mtry 的默认设置是非常明智的,因此实际上没有必要对其进行处理.有一个函数 tuneRF 用于优化此参数.但是,请注意这可能会导致偏差.

The default for mtry is quite sensible so there is not really a need to muck with it. There is a function tuneRF for optimizing this parameter. However, be aware that it may cause bias.

没有优化引导复制的数量.我经常从 ntree=501 开始,然后绘制随机森林对象.这将显示基于 OOB 错误的错误收敛.您需要足够多的树来稳定错误,但又不想太多以至于过度关联集成,从而导致过拟合.

There is no optimization for the number of bootstrap replicates. I often start with ntree=501 and then plot the random forest object. This will show you the error convergence based on the OOB error. You want enough trees to stabilize the error but not so many that you over correlate the ensemble, which leads to overfit.

这里有一个警告:变量交互的稳定速度比误差慢,因此,如果您有大量的自变量,则需要更多的重复.我会将 ntree 保留为奇数,以便可以打破联系.

Here is the caveat: variable interactions stabilize at a slower rate than error so, if you have a large number of independent variables you need more replicates. I would keep the ntree an odd number so ties can be broken.

对于你问题的维度,我会开始ntree=1501.我还建议查看已发布的变量选择方法之一,以减少自变量的数量.

For the dimensions of you problem I would start ntree=1501. I would also recommended looking onto one of the published variable selection approaches to reduce the number of your independent variables.

这篇关于为随机森林回归模型设置 ntree 和 mtry 的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆