在SciKit-Learn中使用交叉验证与XGBoost进行网格搜索和提前停止 [英] Grid Search and Early Stopping Using Cross Validation with XGBoost in SciKit-Learn
问题描述
我对sci-kit学习还很陌生,并且一直在尝试对XGBoost进行超参数调整。我的目标是使用早期停止和网格搜索来调整模型参数,并使用早期停止来控制树的数量并避免过度拟合。
I am fairly new to sci-kit learn and have been trying to hyper-paramater tune XGBoost. My aim is to use early stopping and grid search to tune the model parameters and use early stopping to control the number of trees and avoid overfitting.
当我在网格搜索中使用交叉验证时,我希望在早期停止条件中也使用交叉验证。到目前为止,我拥有的代码如下:
As I am using cross validation for the grid search, I was hoping to also use cross-validation in the early stopping criteria. The code I have so far looks like this:
import numpy as np
import pandas as pd
from sklearn import model_selection
import xgboost as xgb
#Import training and test data
train = pd.read_csv("train.csv").fillna(value=-999.0)
test = pd.read_csv("test.csv").fillna(value=-999.0)
# Encode variables
y_train = train.price_doc
x_train = train.drop(["id", "timestamp", "price_doc"], axis=1)
# XGBoost - sklearn method
gbm = xgb.XGBRegressor()
xgb_params = {
'learning_rate': [0.01, 0.1],
'n_estimators': [2000],
'max_depth': [3, 5, 7, 9],
'gamma': [0, 1],
'subsample': [0.7, 1],
'colsample_bytree': [0.7, 1]
}
fit_params = {
'early_stopping_rounds': 30,
'eval_metric': 'mae',
'eval_set': [[x_train,y_train]]
}
grid = model_selection.GridSearchCV(gbm, xgb_params, cv=5,
fit_params=fit_params)
grid.fit(x_train,y_train)
我遇到的问题是 eval_set参数。我知道这需要预测变量和响应变量,但是我不确定如何将交叉验证数据用作早期停止标准。
The problem I am having is the 'eval_set' parameter. I understand that this wants the predictor and response variables but I am not sure how I can use the cross-validation data as the early stopping criteria.
有人知道如何克服这个问题?谢谢。
Does anyone know how to overcome this problem? Thanks.
推荐答案
您可以将Early_stopping_rounds和eval_set作为额外的fit_params传递给GridSearchCV,这实际上可以工作。但是,GridSearchCV不会在不同折痕之间更改fit_params,因此您最终将在所有折痕中使用相同的eval_set,这可能并不是CV的意思。
You could pass you early_stopping_rounds, and eval_set as an extra fit_params to GridSearchCV, and that would actually work. However, GridSearchCV will not change the fit_params between the different folds, so you would end up using the same eval_set in all the folds, which might not be what you mean by CV.
model=xgb.XGBClassifier()
clf = GridSearchCV(model, parameters,
fit_params={'early_stopping_rounds':20,\
'eval_set':[(X,y)]},cv=kfold)
经过一些调整后,我发现集成early_stopping_rounds和 sklearn
API的最安全方法是实现自己的Early_stopping机制。如果您执行 GridSearchCV
并使用n_rounds作为要调整的参数,则可以执行此操作。然后,您可以通过增加 n_rounds
来观察不同模型的mean_validation_score。然后,您可以定义自定义试探法以提早停止。它不会节省评估所有可能的 n_rounds
所需的计算时间,虽然
After some tweaking, I found the safest way to integrate early_stopping_rounds and the sklearn
API is to implement an early_stopping mechanism your self. You can do it if you do a GridSearchCV
with n_rounds as paramter to be tuned. You can then watch the mean_validation_score for the different models with increasing n_rounds
. Then you can define a custom heuristic for early stop. it wont save the computational time needed to evaluate all the possible n_rounds
though
我认为这也是一种更好的方法然后为此目的使用单个拆分保留。
I think it is also a better approach then using a single split hold-out for this purpose.
这篇关于在SciKit-Learn中使用交叉验证与XGBoost进行网格搜索和提前停止的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!