如何解决 Python sklearn 随机森林中的过度拟合? [英] How do I solve overfitting in random forest of Python sklearn?

查看:285
本文介绍了如何解决 Python sklearn 随机森林中的过度拟合?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用在 python sklearn 包中实现的 RandomForestClassifier 来构建二进制分类模型.以下是交叉验证的结果:

I am using RandomForestClassifier implemented in python sklearn package to build a binary classification model. The below is the results of cross validations:

Fold 1 : Train: 164  Test: 40
Train Accuracy: 0.914634146341
Test Accuracy: 0.55

Fold 2 : Train: 163  Test: 41
Train Accuracy: 0.871165644172
Test Accuracy: 0.707317073171

Fold 3 : Train: 163  Test: 41
Train Accuracy: 0.889570552147
Test Accuracy: 0.585365853659

Fold 4 : Train: 163  Test: 41
Train Accuracy: 0.871165644172
Test Accuracy: 0.756097560976

Fold 5 : Train: 163  Test: 41
Train Accuracy: 0.883435582822
Test Accuracy: 0.512195121951

我正在使用价格"功能来预测质量",这是一个序数值.在每个交叉验证中,有 163 个训练样例和 41 个测试样例.

I am using "Price" feature to predict "quality" which is a ordinal value. In each cross validation, there are 163 training examples and 41 test examples.

显然,这里发生了过度拟合.那么有没有sklearn提供的参数可以用来解决这个问题呢?我在这里找到了一些参数,例如min_samples_split 和 min_sample_leaf,但我不太明白如何调整它们.

Apparently, overfitting occurs here. So is there any parameters provided by sklearn can be used to overcome this problem? I found some parameters here, e.g. min_samples_split and min_sample_leaf, but I do not quite understand how to tune them.

提前致谢!

推荐答案

我同意 @Falcon w.r.t.数据集大小.主要问题很可能是数据集的小尺寸.如果可能,您能做的最好的事情就是获取更多数据,数据越多(通常)过拟合的可能性就越小,因为随着数据集大小的增加,出现预测的随机模式开始被淹没.

I would agree with @Falcon w.r.t. the dataset size. It's likely that the main problem is the small size of the dataset. If possible, the best thing you can do is get more data, the more data (generally) the less likely it is to overfit, as random patterns that appear predictive start to get drowned out as the dataset size increases.

也就是说,我会看看以下参数:

That said, I would look at the following params:

  1. n_estimators:@Falcon 是错误的,一般来说,树越多,算法不太可能过拟合.所以尝试增加这个.此数字越小,模型越接近决策树,具有受限的特征集.
  2. max_features:尝试减少这个数量(尝试减少特征数量的 30-50%).这决定了每棵树随机分配了多少特征.越小,过拟合的可能性就越小,但太小会开始引入欠拟合.
  3. max_depth:试试这个.这将降低学习模型的复杂性,降低过度拟合的风险.尝试从小处着手,比如 5 到 10 个,然后逐渐增加以获得最佳结果.
  4. min_samples_leaf:尝试将其设置为大于 1 的值.这与 max_depth 参数具有类似的效果,这意味着一旦叶子每个具有该数量的样本,分支将停止分裂.
  1. n_estimators: @Falcon is wrong, in general the more trees the less likely the algorithm is to overfit. So try increasing this. The lower this number, the closer the model is to a decision tree, with a restricted feature set.
  2. max_features: try reducing this number (try 30-50% of the number of features). This determines how many features each tree is randomly assigned. The smaller, the less likely to overfit, but too small will start to introduce under fitting.
  3. max_depth: Experiment with this. This will reduce the complexity of the learned models, lowering over fitting risk. Try starting small, say 5-10, and increasing you get the best result.
  4. min_samples_leaf: Try setting this to values greater than one. This has a similar effect to the max_depth parameter, it means the branch will stop splitting once the leaves have that number of samples each.

注意做这项工作时要科学.使用 3 个数据集、一个训练集、一个单独的开发"数据集来调整你的参数,以及一个测试集,用最佳参数测试最终模型.一次只更改一个参数并评估结果.或者尝试使用 sklearn gridsearch 算法一次搜索所有这些参数.

Note when doing this work to be scientific. Use 3 datasets, a training set, a separate 'development' dataset to tweak your parameters, and a test set that tests the final model, with the optimal parameters. Only change one parameter at a time and evaluate the result. Or experiment with the sklearn gridsearch algorithm to search across these parameters all at once.

这篇关于如何解决 Python sklearn 随机森林中的过度拟合?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆