为什么使用随机森林找不到最低的平均绝对误差? [英] Why I can not find lowest mean absolute error using Random Forest?

查看:193
本文介绍了为什么使用随机森林找不到最低的平均绝对误差?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用以下数据集进行Kaggle竞争:

I am doing Kaggle competition with the following dataset: https://www.kaggle.com/c/home-data-for-ml-course/download/train.csv

根据该理论,通过增加随机森林模型中估计量的数量,平均绝对误差将仅下降到一定数量(最佳点),并且进一步增加将导致过度拟合.通过绘制估算器的数量和平均绝对误差,我们应该得到这个红色的图,因为最低点表示估算器的最佳数量.

According to the theory, by increasing number of estimators in Random Forest model the mean absolute error would drop only until some number (sweet spot) and further increase would cause overfitting. By plotting number of estimators and mean absolute errors we should get this red graph, were lowest point marks the best number of estimators.

我尝试通过以下代码找到最佳数量的估计量,但数据图显示MAE一直在下降.我该怎么办?

I try to find best number of estimators with following code, but data plot shows that MAE is constantly decreasing. What do I do wrong?

train_data = pd.read_csv('train.csv')
y = train_data['SalePrice']
#for simplicity dropping all columns with missing values and non-numerical values
X = train_data.drop('SalePrice', axis=1).dropna(axis=1).select_dtypes(['number'])
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
mae_list = []
for n_estimators in range(10, 800, 10):
    rf_model = RandomForestRegressor(n_estimators=n_estimators, random_state=0, n_jobs=8)
    rf_model.fit(X_train, y_train)
    preds = rf_model.predict(X_test)
    mae = mean_absolute_error(y_test, preds)
    mae_list.append({'n_est': n_estimators, 'mae': mae})

#plotting the results
plt.plot([item['n_est'] for item in mae_list], [item['mae'] for item in mae_list])

推荐答案

您不一定在做错事.

更仔细地看您显示的理论曲线,您会注意到水平轴没有丝毫表示应该发生这种最小值的实际树木/迭代次数.这是这种理论预测的一个相当普遍的特征-他们告诉您期望是什么,但是关于您应该确切(甚至大致)期望的地方却没有.

Looking more closely to the theoretical curves you show, you'll notice that the horizontal axis does not contain the slightest indication of the actual number of trees/iterations where such a minimum should happen. And this is a rather general feature of such theoretical predictions - they tell you something is expected, but nothing about where exactly (or even roughly) you should expect it.

记住这一点,我可以从第二个图得出的唯一结论是,在您尝试过的约800棵树的特定范围内,您实际上仍处于预期最小值的左侧".

Keeping this in mind, the only thing I can conclude from your second plot is that, in the specific range of ~ 800 trees you have tried, you are actually still in the "left" of the expected minimum.

同样,也没有理论上的预测,在达到最小值之前,您应该添加多少棵树(800或8,000或...).

Again, there is no theoretical prediction of how many trees (800 or 8,000 or...) you should add before reaching that minimum.

为了在讨论中加入一些经验依据:在我自己的第一场Kaggle比赛中,我们一直在增加树木,直到达到〜 24,000 的数量为止,然后我们的验证错误才开始出现分歧(我们使用的是GBM而不是RF,但基本原理是相同的.

To bring some empirical corroboration into the discussion: in my own first Kaggle competition, we kept adding trees until we reached a number of ~ 24,000, before our validation error started diverging (we were using GBM and not RF, but the rationale is identical).

这篇关于为什么使用随机森林找不到最低的平均绝对误差?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆