如何使用Scikit Learn在Random Forest中调整参数? [英] How to tune parameters in Random Forest, using Scikit Learn?

查看:415
本文介绍了如何使用Scikit Learn在Random Forest中调整参数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

class sklearn.ensemble.RandomForestClassifier(n_estimators=10,
                                              criterion='gini', 
                                              max_depth=None,
                                              min_samples_split=2,
                                              min_samples_leaf=1, 
                                              min_weight_fraction_leaf=0.0, 
                                              max_features='auto', 
                                              max_leaf_nodes=None, 
                                              bootstrap=True, 
                                              oob_score=False,
                                              n_jobs=1, 
                                              random_state=None,
                                              verbose=0, 
                                              warm_start=False, 
                                              class_weight=None)

我正在使用具有9个样本和大约7000个属性的随机森林模型.在这些样本中,我的分类器可以识别3个类别.

I'm using a random forest model with 9 samples and about 7000 attributes. Of these samples, there are 3 categories that my classifier recognizes.

我知道这远非理想条件,但我正在尝试找出哪些属性在特征预测中最重要.哪些参数最适合优化功能重要性?

I know this is far from ideal conditions but I'm trying to figure out which attributes are the most important in feature predictions. Which parameters would be the best to tweak for optimizing feature importance?

我尝试了不同的n_estimators,并注意到重要特征"的数量(即feature_importances_数组中的非零值)急剧增加.

I tried different n_estimators and noticed that the amount of "significant features" (i.e. nonzero values in the feature_importances_ array) increased dramatically.

我已经阅读了文档,但是如果有任何经验,我想知道最适合调整哪些参数,并简要说明原因.

I've read through the documentation but if anyone has any experience in this, I would like to know which parameters are the best to tune and a brief explanation why.

推荐答案

根据我的经验,sklearn RandomForestClassifier具有三个值得探讨的功能,按重要性顺序排列:

From my experience, there are three features worth exploring with the sklearn RandomForestClassifier, in order of importance:

  • n_estimators

max_features

criterion

n_estimators确实不值得优化.您给它的估算器越多,它将做得越好. 500或1000通常就足够了.

n_estimators is not really worth optimizing. The more estimators you give it, the better it will do. 500 or 1000 is usually sufficient.

max_features值得探索许多不同的值.它可能决定RF的行为,因为它决定了RF中每棵树在每个分割处考虑多少个特征.

max_features is worth exploring for many different values. It may have a large impact on the behavior of the RF because it decides how many features each tree in the RF considers at each split.

criterion可能会产生很小的影响,但通常默认设置是可以的.如果有时间,请尝试一下.

criterion may have a small impact, but usually the default is fine. If you have the time, try it out.

确保使用sklearn的 GridSearch (最好是GridSearchCV,但要使用您的数据集尺寸太小),请尝试使用这些参数.

Make sure to use sklearn's GridSearch (preferably GridSearchCV, but your data set size is too small) when trying out these parameters.

但是,如果我正确理解了您的问题,那么您只有9个样本和3个类?大概每个课有3个样本?除非它们是好的,有代表性的记录,否则您的RF很可能会过少地容纳少量数据.

If I understand your question correctly, though, you only have 9 samples and 3 classes? Presumably 3 samples per class? It's very, very likely that your RF is going to overfit with that little amount of data, unless they are good, representative records.

这篇关于如何使用Scikit Learn在Random Forest中调整参数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆