如何在RandomForestClassifier中选择n_estimators? [英] How to choose n_estimators in RandomForestClassifier?

查看:256
本文介绍了如何在RandomForestClassifier中选择n_estimators?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用python预处理数据集上构建一个Random Forest Binary Classsifier,该数据集具有4898个实例,60-40的分层分割比率以及78%的数据属于一个目标标签,而其余的则属于另一个目标标签.我应该选择n_estimators的哪个值以实现最实用/最可能的随机森林分类器模型?我使用下面的代码段绘制了精度vs n_estimators曲线.x_trai和y_train分别是训练集中的特征和目标标签,x_test和y_test分别是测试集中的特征和目标标签.

I'm building a Random Forest Binary Classsifier in python on a pre-processed dataset with 4898 instances, 60-40 stratified split-ratio and 78% data belonging to one target label and the rest to the other. What value of n_estimators should I choose in order to achieve the most practically useful / best possible random forest classifer model? I plotted the accuracy vs n_estimators curve using the code snippet below. x_trai and, y_train are the features and target labels in training set respectively and x_test and y_test are the features and target labels in the test set respectively.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
scores =[]
for k in range(1, 200):
    rfc = RandomForestClassifier(n_estimators=k)
    rfc.fit(x_train, y_train)
    y_pred = rfc.predict(x_test)
    scores.append(accuracy_score(y_test, y_pred))

import matplotlib.pyplot as plt
%matplotlib inline

# plot the relationship between K and testing accuracy
# plt.plot(x_axis, y_axis)
plt.plot(range(1, 200), scores)
plt.xlabel('Value of n_estimators for Random Forest Classifier')
plt.ylabel('Testing Accuracy')

在这里,可以看出n_estimators的高值将给出良好的准确度得分,但是即使对于n_estimators的附近值,它在曲线中也是随机波动的,因此我无法精确地选择最佳值.我只想了解 n_estimators 超参数的调整,我该如何选择它,请帮忙.我应该使用ROC或CAP曲线代替 accuracy_score 吗?谢谢.

Here, it is visible that a high value for n_estimators will give a good acuracy score, but it is fluctuating randomly in the curve even for nearby values of n_estimators, so I can't pick the best one precisely. I only want to know about the tuning of n_estimators hyperparameter, how should I choose it, please help. Should I use ROC or CAP curve instead of accuracy_score? Thanks.

推荐答案

很自然,随机森林将在某些n_estimators之后稳定下来(因为没有机械方法像减速一样,减慢"拟合速度).由于添加更多的弱树估计量没有好处,因此您可以选择50个左右的

It is natural that random forest will stabilize after some n_estimators(because there is no mechnisum to "slow down" the fitting unlike boosting). Since there is no benefit to adding more weak tree estimators, you can choose around 50

这篇关于如何在RandomForestClassifier中选择n_estimators?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆