为随机森林回归模型设置ntree和mtry的值 [英] setting values for ntree and mtry for random forest regression model

查看:2126
本文介绍了为随机森林回归模型设置ntree和mtry的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用R包randomForest对某些生物学数据进行回归分析.我的训练数据大小为38772 X 201.

I'm using R package randomForest to do a regression on some biological data. My training data size is 38772 X 201.

我只是想知道--树ntree的数量和每个级别mtry的变量的数量哪个值合适?是否存在一个近似公式来查找此类参数值?

I just wondered---what would be a good value for the number of trees ntree and the number of variable per level mtry? Is there an approximate formula to find such parameter values?

我的输入数据中的每一行都是200个字符,代表氨基酸序列,我想建立一个回归模型以使用这种序列来预测蛋白质之间的距离.

Each row in my input data is a 200 character representing the amino acid sequence, and I want to build a regression model to use such sequence in order to predict the distances between the proteins.

推荐答案

mtry的默认设置非常合理,因此实际上不需要对其进行处理.有一个功能tuneRF用于优化此参数.但是,请注意,这可能会导致偏差.

The default for mtry is quite sensible so there is not really a need to muck with it. There is a function tuneRF for optimizing this parameter. However, be aware that it may cause bias.

没有对引导复制的数量进行优化.我通常从ntree=501开始,然后绘制随机森林对象.这将显示基于OOB错误的错误收敛.您希望有足够的树来稳定错误,但又不要过多,以至于无法使整体过度关联,从而导致过度拟合.

There is no optimization for the number of bootstrap replicates. I often start with ntree=501 and then plot the random forest object. This will show you the error convergence based on the OOB error. You want enough trees to stabilize the error but not so many that you over correlate the ensemble, which leads to overfit.

需要注意的是:变量交互以比错误更慢的速度稳定,因此,如果您有大量自变量,则需要更多的重复.我将ntree保留为奇数,以便可以打破联系.

Here is the caveat: variable interactions stabilize at a slower rate than error so, if you have a large number of independent variables you need more replicates. I would keep the ntree an odd number so ties can be broken.

对于您所遇到的问题,我将从ntree=1501开始.我还建议您考虑一种已发布的变量选择方法,以减少自变量的数量.

For the dimensions of you problem I would start ntree=1501. I would also recommended looking onto one of the published variable selection approaches to reduce the number of your independent variables.

这篇关于为随机森林回归模型设置ntree和mtry的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆