次优停止不是最理想的方法,可以防止机器学习过度适应吗? [英] Suboptimal Early Stopping prevents overfitting in Machine Learning?

查看:73
本文介绍了次优停止不是最理想的方法,可以防止机器学习过度适应吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用xgboost的早期停止功能来处理各种问题陈述,主要是分类问题.但是在处理几个不同域的数据集时,我有以下观察见

  • 在最小评估误差点处,但是训练与测试(用于评估以停止训练回合)之间的差异相对较大时,该模型的行为似乎就像是过度拟合./p>

  • 在这种情况下,当我考虑在训练和测试(训练期间的评估数据)误差相当相似(尽管评估误差不是最小)的点停止训练回合时,模型表现更好,并且按照错误项估计.

因此,问题是:训练轮次的数量是否应该比最佳点(在最佳点(训练和测试(评估)之间的发散误差非常大,尽管验证误差较低))更早地停止?

请确保已采取一切措施正确分割数据集以进行训练,测试,验证等.

谢谢.

解决方案

在xgboost中提前停止的工作方式如下:

  • 它会在其中查看监视列表"的最后一个元组(通常将验证/测试集放在其中)
  • 它将根据您的评估指标来评估此设置
  • 如果此评估未更改过 x 次(其中 x = early_stopping_rounds )
  • 模型停止训练,并知道最佳迭代在哪里(对测试/验证集进行最佳评估)

是的,您的模型将使用 x 不必要的迭代(增强器)构建.但是,假设您在 clf

中具有受过训练的 xgboost.Booster

 #将为您提供最佳的迭代best_iteration = clf.best_ntree_limit#将仅使用助推器进行预测,直到获得最佳迭代y_pred = clf.predict(dtest,ntree_limit = best_iteration) 

对您的问题的回答是.

I have been using the early stopping feature of xgboost for variety of problem statements, mostly classification. But I have the following observation when working on couple of datasets of different domains

  • At point of minimum evaluation error, but where the difference between train and test (used for evaluation to stop training rounds) errors is relatively high, the model seems to behave as if there has been over-fitting.

  • In such situations when I consider stopping training rounds at point at which both train and test (evaluation data during training) errors are reasonably similar (though evaluation error is not at minimum), the models perform better and as per the error terms estimation.

Therefore the question is: should the number of training rounds be stopped much earlier than at the optimal point (where there is a very high divergence error between train and test (eval), though validation error is lower)?

Please assume that every care has been taken to correctly split the datasets for train, test, validation, etc.

Thanks.

解决方案

Early stopping in xgboost works as follows:

  • It looks over the last tuple of your "watchlist" (usually you put the validation/testing set) there
  • It evaluates this set by your evaluation metric
  • If this evaluation hasn't changed for x times (where x = early_stopping_rounds)
  • The model stops train, and know where was the best iteration (with the best evaluation of your test/validation set)

Yes, your model will be built with x unnecessary iterations (boosters). But assuming you have a trained xgboost.Booster in clf

# Will give you the best iteration
best_iteration = clf.best_ntree_limit

# Will predict only using the boosters untill the best iteration
y_pred = clf.predict(dtest, ntree_limit=best_iteration)

Which concludes a no, to your question.

这篇关于次优停止不是最理想的方法,可以防止机器学习过度适应吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆