如何解决Python sklearn随机森林中的过拟合问题? [英] How do I solve overfitting in random forest of Python sklearn?

查看:4600
本文介绍了如何解决Python sklearn随机森林中的过拟合问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用在python sklearn包中实现的RandomForestClassifier构建二进制分类模型.以下是交叉验证的结果:

I am using RandomForestClassifier implemented in python sklearn package to build a binary classification model. The below is the results of cross validations:

Fold 1 : Train: 164  Test: 40
Train Accuracy: 0.914634146341
Test Accuracy: 0.55

Fold 2 : Train: 163  Test: 41
Train Accuracy: 0.871165644172
Test Accuracy: 0.707317073171

Fold 3 : Train: 163  Test: 41
Train Accuracy: 0.889570552147
Test Accuracy: 0.585365853659

Fold 4 : Train: 163  Test: 41
Train Accuracy: 0.871165644172
Test Accuracy: 0.756097560976

Fold 5 : Train: 163  Test: 41
Train Accuracy: 0.883435582822
Test Accuracy: 0.512195121951

我正在使用价格"功能来预测质量",这是一个序数值.在每个交叉验证中,都有163个训练示例和41个测试示例.

I am using "Price" feature to predict "quality" which is a ordinal value. In each cross validation, there are 163 training examples and 41 test examples.

显然,这里发生过拟合.那么,sklearn提供的任何参数都可以用来克服此问题吗?我在此处找到了一些参数,例如min_samples_split和min_sample_leaf,但我不太了解如何调整它们.

Apparently, overfitting occurs here. So is there any parameters provided by sklearn can be used to overcome this problem? I found some parameters here, e.g. min_samples_split and min_sample_leaf, but I do not quite understand how to tune them.

提前谢谢!

推荐答案

我同意@Falconw.r.t.数据集大小.主要问题可能是数据集的大小.如果可能的话,您能做的最好的事情就是获取更多的数据,(通常)越多的数据越可能被过度拟合,因为随着数据集大小的增加,出现预示性的随机模式开始被淹没.

I would agree with @Falcon w.r.t. the dataset size. It's likely that the main problem is the small size of the dataset. If possible, the best thing you can do is get more data, the more data (generally) the less likely it is to overfit, as random patterns that appear predictive start to get drowned out as the dataset size increases.

也就是说,我将看以下参数:

That said, I would look at the following params:

  1. n_estimators:@Falcon是错误的,一般来说,树越多,则算法过拟合的可能性就越不太可能.因此,尝试增加此功能.该数字越小,模型就越接近具有受限功能集的决策树.
  2. max_features:尝试减少此数目(尝试减少30-50%的特征).这确定了每棵树被随机分配多少个特征.较小的物体,过拟合的可能性较小,但过小的尺寸将开始引入不适当的接触.
  3. max_depth:尝试一下.这将减少学习的模型的复杂性,降低过度拟合的风险.尝试从5-10开始,从小开始,然后增加您可获得最佳结果.
  4. min_samples_leaf:尝试将其设置为大于1的值.这与max_depth参数具有相似的效果,这意味着一旦叶子每个具有相同数量的样本,分支将停止分裂.
  1. n_estimators: @Falcon is wrong, in general the more trees the less likely the algorithm is to overfit. So try increasing this. The lower this number, the closer the model is to a decision tree, with a restricted feature set.
  2. max_features: try reducing this number (try 30-50% of the number of features). This determines how many features each tree is randomly assigned. The smaller, the less likely to overfit, but too small will start to introduce under fitting.
  3. max_depth: Experiment with this. This will reduce the complexity of the learned models, lowering over fitting risk. Try starting small, say 5-10, and increasing you get the best result.
  4. min_samples_leaf: Try setting this to values greater than one. This has a similar effect to the max_depth parameter, it means the branch will stop splitting once the leaves have that number of samples each.

进行这项工作时要注意科学.使用3个数据集,一个训练集,一个单独的开发"数据集来调整您的参数,以及一个使用最佳参数测试最终模型的测试集.一次只更改一个参数并评估结果.或尝试使用sklearn网格搜索算法一次搜索所有这些参数.

Note when doing this work to be scientific. Use 3 datasets, a training set, a separate 'development' dataset to tweak your parameters, and a test set that tests the final model, with the optimal parameters. Only change one parameter at a time and evaluate the result. Or experiment with the sklearn gridsearch algorithm to search across these parameters all at once.

这篇关于如何解决Python sklearn随机森林中的过拟合问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆