sklearn随机森林:.oob_score_太低? [英] sklearn random forest: .oob_score_ too low?

查看:494
本文介绍了sklearn随机森林:.oob_score_太低?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找随机森林的应用程序,并且在Kaggle上发现了以下知识竞争:



https://www.kaggle.com/c/forest-cover-type-prediction



遵循建议



https://www.kaggle.com/c/forest-cover-type-prediction/forums / t / 8182 / first-try-with-random-forests-scikit-learn



我使用了 sklearn 来建立一个有500棵树的随机森林。



.oob_score _ 为〜2%,但坚持集的得分为〜75%。



只有7个类别可供分类,因此2 %真的很低。当我进行交叉验证时,我也始终获得接近75%的分数。



有人可以解释 .oob_score _ 以及保留/交叉验证分数?我希望它们会相似。



这里有一个类似的问题:



https://stats.stackexchange.com/questions/95818/what-is-一个很好的随机森林得分



编辑:我认为这也可能是一个错误。



代码由我张贴的第二个链接中的原始海报给出。唯一的变化是,在构建随机森林时,您必须设置 oob_score = True



我没有保存我做过的交叉验证测试,但是如果人们需要查看它,我可以重做它。

解决方案

Q:谁能解释这个差异...



A: sklearn.ensemble .RandomForestClassifier 对象,并且观察到 属性值与错误无关。 / p>

首先,基于 RandomForest 的预测变量 {分类器Regressor} 属于所谓的集成方法的相当具体的角落,因此请注意,典型方法,包括交叉验证,与其他AI / ML学习者的工作方式不同



RandomForest <一个href = http://stat.ethz.ch/education/semesters/ss2012/ams/slides/v10.2.pdf rel = nofollow>内部逻辑与 RANDOM-PROCESS紧密配合 ,通过这些样本(数据集 X )具有已知的 y == {标签(对于分类器) |目标(对于Regressor)} ,在整个林代中进行拆分,其中树通过随机将DataSET拆分为 bootstraped



除了对过拟合敏感度的其他影响外,树还可以看到,而树则看不到(因此形成了inner-oob-subSET)。等人, RandomForest 集成不需要交叉验证,因为它不会因设计而过拟合。许多论文以及 布莱曼氏 (伯克利)经验证明为这样的陈述提供了支持,因为它们带来了证据,即CV版本的预测变量将具有相同的 .oob_score _

 导入sklearn.ensemble 
aRF_PREDICTOR = sklearn.ensemble.RandomForestRegressor(n_estimators = 10,#森林中的树木数量。
条件='mse',#{回归:'mse'|分类器:'gini'}
max_depth =无,
min_samples_split = 2,
min_samples_leaf = 1,
min_we ight_fraction_leaf = 0.0,
max_features ='auto',
max_leaf_nodes = None,
bootstrap = True,
oob_score = False,#SET设置为true以获取类似于内部CrossValidationation的.oob_score_属性在训练阶段在整个DataSET
n_jobs = 1,#{1 | n核| -1 ==所有核}
random_state =无,
详细= 0,
warm_start = False

aRF_PREDICTOR.estimators_#a< DecisionTreeRegressor>的列表;拟合子估计量的集合。
aRF_PREDICTOR.feature_importances_#形状数组= [n_features]功能重要性(功能越高,重要性越高)。
aRF_PREDICTOR.oob_score_#float使用袋外估计值获得的训练数据集的分数。
aRF_PREDICTOR.oob_prediction_#形状数组= [n_samples]使用对训练集的实际估计值计算的预测。

aRF_PREDICTOR.apply(X)#将森林中的树木应用于X,返回叶子索引。
aRF_PREDICTOR.fit(X,y [,sample_weight])#根据训练集(X,y)建立树木森林。
aRF_PREDICTOR.fit_transform(X [,y])#适合数据,然后进行转换。
aRF_PREDICTOR.get_params([deep])#获取此估计量的参数。
aRF_PREDICTOR.predict(X)#预测X的回归目标。
aRF_PREDICTOR.score(X,y [,sample_weight])#返回预测的确定系数R ^ 2。
aRF_PREDICTOR.set_params(** params)#设置此估计量的参数。
aRF_PREDICTOR.transform(X [,threshold])#将X还原为其最重要的功能。

还应告知,默认值不能最好地使用,在任何情况下都不能很好地使用。在继续之前,应该注意问题域,以便提出合理的一组 集成 参数化。






问:.oob_score_有什么用?



A:.oob_score_是随机的! 。 。 。 。 。 。 .....是的,必须(是随机的)



虽然这听起来像是挑衅性的尾声,但不要放弃希望。
RandomForest集成是一个很棒的工具。功能中的分类值可能会带来一些问题(DataSET X ),但是一旦您既不必为偏见或过度拟合而奋斗,处理合奏的成本仍然足够。 那太好了,不是吗?



由于需要能够在随后的重新运行中重现相同的结果,因此建议的做法,以(重新)设置 numpy.random & .set_params(random_state = ...) 到RANDOM-PROCESS之前的一个知识状态(嵌入到RandomForest集成的每个boostrapping中)。这样做,可能会观察到基于 RandomForest 的预测变量的去噪进程朝着更好的 .oob_score _ ,而不是由于更多合奏成员引入了真正 的预测能力( n_estimators ),约束较少的树结构( max_depth max_leaf_nodes 等),而不是随机地在如何拆分数据集的过程中只是好运 ...



向更好的解决方案靠拢通常涉及集合中有更多的树(RandomForest决策基于多数票,因此10个估算器并不是对高度复杂的DataSET做出好的决策的主要依据)。 2000以上的数字并不少见。可能会迭代一系列大小调整(将RANDOM-PROCESS保持在全状态控制下)以证明整体改进。



如果的初始值 .oob_score _ 下降约0.51-0.53,您的合奏比随机游说好 1%-3%



只有在将基于集合的预测变量做得更好之后,您才可以使用有关功能工程等方面的其他技巧。

  aRF_PREDICTOR.oob_score_ Out [79]:0.638801#n_estimators = 10 
aRF_PREDICTOR.oob_score_ Out [89]:0.789612#n_estimators = 100


I was searching for applications for random forests, and I found the following knowledge competition on Kaggle:

https://www.kaggle.com/c/forest-cover-type-prediction.

Following the advice at

https://www.kaggle.com/c/forest-cover-type-prediction/forums/t/8182/first-try-with-random-forests-scikit-learn,

I used sklearn to build a random forest with 500 trees.

The .oob_score_ was ~2%, but the score on the holdout set was ~75%.

There are only seven classes to classify, so 2% is really low. I also consistently got scores near 75% when I cross validated.

Can anyone explain the discrepancy between the .oob_score_ and the holdout/cross validated scores? I would expect them to be similar.

There's a similar question here:

https://stats.stackexchange.com/questions/95818/what-is-a-good-oob-score-for-random-forests

Edit: I think it might be a bug, too.

The code is given by the original poster in the second link I posted. The only change is that you have to set oob_score = True when you build the random forest.

I didn't save the cross validation testing I did, but I could redo it if people need to see it.

解决方案

Q: Can anyone explain the discrepancy ...

A: The sklearn.ensemble.RandomForestClassifier object and it's observed .oob_score_ attribute value is not a bug-related issue.

First, RandomForest-based predictors { Classifier | Regressor } belong to rather specific corner of so called ensemble methods, so be informed, that typical approaches, incl. Cross-Validation, do not work the same way as for other AI/ML-learners.

RandomForest "inner"-logic works heavily with RANDOM-PROCESS, by which the Samples ( DataSET X ) with known y == { labels ( for Classifier ) | targets ( for Regressor ) }, gets split throughout the forest generation, where trees get bootstraped by RANDOMLY split DataSET into part, that the tree can see and a part, the tree will not see ( forming thus an inner-oob-subSET ).

Besides other effects on a sensitivity to overfitting et al, the RandomForest ensemble does not have a need to get Cross-Validated, because it does not over-fit by design. Many papers and also Breiman's (Berkeley) empirical proofs have provided support for such statement, as they brought evindence, that CV-ed predictor will have the same .oob_score_

import sklearn.ensemble
aRF_PREDICTOR = sklearn.ensemble.RandomForestRegressor( n_estimators                = 10,           # The number of trees in the forest.
                                                        criterion                   = 'mse',        # { Regressor: 'mse' | Classifier: 'gini' }
                                                        max_depth                   = None,
                                                        min_samples_split           = 2,
                                                        min_samples_leaf            = 1,
                                                        min_weight_fraction_leaf    = 0.0,
                                                        max_features                = 'auto',
                                                        max_leaf_nodes              = None,
                                                        bootstrap                   = True,
                                                        oob_score                   = False,        # SET True to get inner-CrossValidation-alike .oob_score_ attribute calculated right during Training-phase on the whole DataSET
                                                        n_jobs                      = 1,            # { 1 | n-cores | -1 == all-cores }
                                                        random_state                = None,
                                                        verbose                     = 0,
                                                        warm_start                  = False
                                                        )
aRF_PREDICTOR.estimators_                             # aList of <DecisionTreeRegressor>  The collection of fitted sub-estimators.
aRF_PREDICTOR.feature_importances_                    # array of shape = [n_features]     The feature importances (the higher, the more important the feature).
aRF_PREDICTOR.oob_score_                              # float                             Score of the training dataset obtained using an out-of-bag estimate.
aRF_PREDICTOR.oob_prediction_                         # array of shape = [n_samples]      Prediction computed with out-of-bag estimate on the training set.

aRF_PREDICTOR.apply(         X )                      # Apply trees in the forest to X, return leaf indices.
aRF_PREDICTOR.fit(           X, y[, sample_weight] )  # Build a forest of trees from the training set (X, y).
aRF_PREDICTOR.fit_transform( X[, y] )                 # Fit to data, then transform it.
aRF_PREDICTOR.get_params(          [deep] )           # Get parameters for this estimator.
aRF_PREDICTOR.predict(       X )                      # Predict regression target for X.
aRF_PREDICTOR.score(         X, y[, sample_weight] )  # Returns the coefficient of determination R^2 of the prediction.
aRF_PREDICTOR.set_params(          **params )         # Set the parameters of this estimator.
aRF_PREDICTOR.transform(     X[, threshold] )         # Reduce X to its most important features.

One shall be also informed, that default values do not serve best, the less serve well under any circumstances. One shall take care to the problem-domain so as to propose a reasonable set of ensemble parametrisation, before moving further.


Q: What is a good .oob_score_ ?

A: .oob_score_ is RANDOM! . . . . . . .....Yes, it MUST ( be random )

While this sound as a provocative epilogue, do not throw your hopes away. RandomForest ensemble is a great tool. Some problems may come with categoric-values in features ( DataSET X ), however the costs of processing the ensemble are still adequate once you need not struggle with neither bias nor overfitting. That's great, isn't it?

Due to the need to be able to reproduce same results on subsequent re-runs, it is a recommendable practice to (re-)set numpy.random & .set_params( random_state = ... ) to a know-state before the RANDOM-PROCESS ( embedded into every boostrapping of the RandomForest ensemble ). Doing that, one may observe a "de-noised" progression of the RandomForest-based predictor in a direction of better .oob_score_ rather due to trully improved predictive powers introduced by more ensemble members ( n_estimators ), less constrained tree-construction ( max_depth, max_leaf_nodes et al ) and not just stochastically by just "better luck" during the RANDOM-PROCESS of how to split the DataSET...

Going closer towards better solutions typically involves more trees into the ensemble ( RandomForest decisions are based on a majority vote, so 10-estimators is not a big basis for making good decisions on highly complex DataSETs ). Numbers above 2000 are not uncommon. One may iterate over a range of sizings ( with RANDOM-PROCESS kept under state-full control ) to demonstrate the ensemble "improvements".

If initial values of .oob_score_ fall somewhere around about 0.51 - 0.53 your ensemble is 1% - 3% better than a RANDOM-GUESS

Only after you make your ensemble-based predictor to something better, you may move into some additional tricks on feature engineering et al.

aRF_PREDICTOR.oob_score_    Out[79]: 0.638801  # n_estimators =   10
aRF_PREDICTOR.oob_score_    Out[89]: 0.789612  # n_estimators =  100

这篇关于sklearn随机森林:.oob_score_太低?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆