在SciKit-Learn中使用交叉验证与XGBoost进行网格搜索和提前停止 [英] Grid Search and Early Stopping Using Cross Validation with XGBoost in SciKit-Learn

查看:605
本文介绍了在SciKit-Learn中使用交叉验证与XGBoost进行网格搜索和提前停止的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对sci-kit学习还很陌生,并且一直在尝试对XGBoost进行超参数调整。我的目标是使用早期停止和网格搜索来调整模型参数,并使用早期停止来控制树的数量并避免过度拟合。

I am fairly new to sci-kit learn and have been trying to hyper-paramater tune XGBoost. My aim is to use early stopping and grid search to tune the model parameters and use early stopping to control the number of trees and avoid overfitting.

当我在网格搜索中使用交叉验证时,我希望在早期停止条件中也使用交叉验证。到目前为止,我拥有的代码如下:

As I am using cross validation for the grid search, I was hoping to also use cross-validation in the early stopping criteria. The code I have so far looks like this:

import numpy as np
import pandas as pd
from sklearn import model_selection
import xgboost as xgb

#Import training and test data
train = pd.read_csv("train.csv").fillna(value=-999.0)
test = pd.read_csv("test.csv").fillna(value=-999.0)

# Encode variables
y_train = train.price_doc
x_train = train.drop(["id", "timestamp", "price_doc"], axis=1)

# XGBoost - sklearn method
gbm = xgb.XGBRegressor()

xgb_params = {
'learning_rate': [0.01, 0.1],
'n_estimators': [2000],
'max_depth': [3, 5, 7, 9],
'gamma': [0, 1],
'subsample': [0.7, 1],
'colsample_bytree': [0.7, 1]
}

fit_params = {
'early_stopping_rounds': 30,
'eval_metric': 'mae',
'eval_set': [[x_train,y_train]]
}

grid = model_selection.GridSearchCV(gbm, xgb_params, cv=5, 
fit_params=fit_params)
grid.fit(x_train,y_train)

我遇到的问题是 eval_set参数。我知道这需要预测变量和响应变量,但是我不确定如何将交叉验证数据用作早期停止标准。

The problem I am having is the 'eval_set' parameter. I understand that this wants the predictor and response variables but I am not sure how I can use the cross-validation data as the early stopping criteria.

有人知道如何克服这个问题?谢谢。

Does anyone know how to overcome this problem? Thanks.

推荐答案

您可以将Early_stopping_rounds和eval_set作为额外的fit_params传递给GridSearchCV,这实际上可以工作。但是,GridSearchCV不会在不同折痕之间更改fit_params,因此您最终将在所有折痕中使用相同的eval_set,这可能并不是CV的意思。

You could pass you early_stopping_rounds, and eval_set as an extra fit_params to GridSearchCV, and that would actually work. However, GridSearchCV will not change the fit_params between the different folds, so you would end up using the same eval_set in all the folds, which might not be what you mean by CV.

model=xgb.XGBClassifier()
clf = GridSearchCV(model, parameters,
                         fit_params={'early_stopping_rounds':20,\
                         'eval_set':[(X,y)]},cv=kfold)  

经过一些调整后,我发现集成early_stopping_rounds和 sklearn API的最安全方法是实现自己的Early_stopping机制。如果您执行 GridSearchCV 并使用n_rounds作为要调整的参数,则可以执行此操作。然后,您可以通过增加 n_rounds 来观察不同模型的mean_validation_score。然后,您可以定义自定义试探法以提早停止。它不会节省评估所有可能的 n_rounds 所需的计算时间,虽然

After some tweaking, I found the safest way to integrate early_stopping_rounds and the sklearn API is to implement an early_stopping mechanism your self. You can do it if you do a GridSearchCV with n_rounds as paramter to be tuned. You can then watch the mean_validation_score for the different models with increasing n_rounds. Then you can define a custom heuristic for early stop. it wont save the computational time needed to evaluate all the possible n_rounds though

我认为这也是一种更好的方法然后为此目的使用单个拆分保留。

I think it is also a better approach then using a single split hold-out for this purpose.

这篇关于在SciKit-Learn中使用交叉验证与XGBoost进行网格搜索和提前停止的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆