如何使用 Scikit Learn 调整随机森林中的参数? [英] How to tune parameters in Random Forest, using Scikit Learn?

查看:43
本文介绍了如何使用 Scikit Learn 调整随机森林中的参数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

class sklearn.ensemble.RandomForestClassifier(n_estimators=10,
                                              criterion='gini', 
                                              max_depth=None,
                                              min_samples_split=2,
                                              min_samples_leaf=1, 
                                              min_weight_fraction_leaf=0.0, 
                                              max_features='auto', 
                                              max_leaf_nodes=None, 
                                              bootstrap=True, 
                                              oob_score=False,
                                              n_jobs=1, 
                                              random_state=None,
                                              verbose=0, 
                                              warm_start=False, 
                                              class_weight=None)

我正在使用具有 9 个样本和大约 7000 个属性的随机森林模型.在这些样本中,我的分类器可以识别 3 个类别.

I'm using a random forest model with 9 samples and about 7000 attributes. Of these samples, there are 3 categories that my classifier recognizes.

我知道这远非理想条件,但我试图找出哪些属性在特征预测中最重要.哪些参数最适合调整以优化特征重要性?

I know this is far from ideal conditions but I'm trying to figure out which attributes are the most important in feature predictions. Which parameters would be the best to tweak for optimizing feature importance?

我尝试了不同的 n_estimators 并注意到重要特征"的数量(即 feature_importances_ 数组中的非零值)急剧增加.

I tried different n_estimators and noticed that the amount of "significant features" (i.e. nonzero values in the feature_importances_ array) increased dramatically.

我已通读文档,但如果有人对此有任何经验,我想知道哪些参数最适合调整并简要说明原因.

I've read through the documentation but if anyone has any experience in this, I would like to know which parameters are the best to tune and a brief explanation why.

推荐答案

根据我的经验,sklearn RandomForestClassifier 有三个值得探索的功能,按重要性排序:

From my experience, there are three features worth exploring with the sklearn RandomForestClassifier, in order of importance:

  • n_estimators

max_features

标准

n_estimators 真的不值得优化.你给它的估计量越多,它的效果就越好.500 或 1000 通常就足够了.

n_estimators is not really worth optimizing. The more estimators you give it, the better it will do. 500 or 1000 is usually sufficient.

max_features 值得探索许多不同的值.它可能对 RF 的行为产生很大影响,因为它决定了 RF 中的每棵树在每次拆分时考虑的特征数量.

max_features is worth exploring for many different values. It may have a large impact on the behavior of the RF because it decides how many features each tree in the RF considers at each split.

criterion 可能会产生很小的影响,但通常默认值就可以了.如果您有时间,请尝试一下.

criterion may have a small impact, but usually the default is fine. If you have the time, try it out.

确保使用 sklearn 的 GridSearch(最好是 GridSearchCV,但你的数据集尺寸太小)尝试这些参数时.

Make sure to use sklearn's GridSearch (preferably GridSearchCV, but your data set size is too small) when trying out these parameters.

不过,如果我正确理解您的问题,您只有 9 个样本和 3 个班级?大概每班3个样本?除非它们是良好的、具有代表性的记录,否则您的 RF 很有可能会因少量数据而过度拟合.

If I understand your question correctly, though, you only have 9 samples and 3 classes? Presumably 3 samples per class? It's very, very likely that your RF is going to overfit with that little amount of data, unless they are good, representative records.

这篇关于如何使用 Scikit Learn 调整随机森林中的参数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆