如何在RandomForestClassifier中选择n_estimators? [英] How to choose n_estimators in RandomForestClassifier?
问题描述
我正在使用python预处理数据集上构建一个Random Forest Binary Classsifier,该数据集具有4898个实例,60-40的分层分割比率以及78%的数据属于一个目标标签,而其余的则属于另一个目标标签.我应该选择n_estimators的哪个值以实现最实用/最可能的随机森林分类器模型?我使用下面的代码段绘制了精度vs n_estimators曲线.x_trai和y_train分别是训练集中的特征和目标标签,x_test和y_test分别是测试集中的特征和目标标签.
I'm building a Random Forest Binary Classsifier in python on a pre-processed dataset with 4898 instances, 60-40 stratified split-ratio and 78% data belonging to one target label and the rest to the other. What value of n_estimators should I choose in order to achieve the most practically useful / best possible random forest classifer model? I plotted the accuracy vs n_estimators curve using the code snippet below. x_trai and, y_train are the features and target labels in training set respectively and x_test and y_test are the features and target labels in the test set respectively.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
scores =[]
for k in range(1, 200):
rfc = RandomForestClassifier(n_estimators=k)
rfc.fit(x_train, y_train)
y_pred = rfc.predict(x_test)
scores.append(accuracy_score(y_test, y_pred))
import matplotlib.pyplot as plt
%matplotlib inline
# plot the relationship between K and testing accuracy
# plt.plot(x_axis, y_axis)
plt.plot(range(1, 200), scores)
plt.xlabel('Value of n_estimators for Random Forest Classifier')
plt.ylabel('Testing Accuracy')
在这里,可以看出n_estimators的高值将给出良好的准确度得分,但是即使对于n_estimators的附近值,它在曲线中也是随机波动的,因此我无法精确地选择最佳值.我只想了解 n_estimators
超参数的调整,我该如何选择它,请帮忙.我应该使用ROC或CAP曲线代替 accuracy_score
吗?谢谢.
Here, it is visible that a high value for n_estimators will give a good acuracy score, but it is fluctuating randomly in the curve even for nearby values of n_estimators, so I can't pick the best one precisely. I only want to know about the tuning of n_estimators
hyperparameter, how should I choose it, please help. Should I use ROC or CAP curve instead of accuracy_score
? Thanks.
推荐答案
很自然,随机森林将在某些n_estimators之后稳定下来(因为没有机械方法像减速一样,减慢"拟合速度).由于添加更多的弱树估计量没有好处,因此您可以选择50个左右的
It is natural that random forest will stabilize after some n_estimators(because there is no mechnisum to "slow down" the fitting unlike boosting). Since there is no benefit to adding more weak tree estimators, you can choose around 50
这篇关于如何在RandomForestClassifier中选择n_estimators?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!