XGboost:无法在管道中传递eval_set的验证数据 [英] XGboost: cannot pass validation data for eval_set in pipeline

查看:100
本文介绍了XGboost:无法在管道中传递eval_set的验证数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在管道中为XGboost模型实现GridSearchCV.我在代码上方定义了数据预处理器,并提供了一些网格参数

I want to implement GridSearchCV for XGboost model in pipeline. I have preprocessor for data, defined above the code, some grid params

XGBmodel = XGBRegressor(random_state=0)
pipe = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('XGBmodel', XGBmodel)
])

我想通过这些合适的参数

And I want to pass these fit params

fit_params = {"XGBmodel__eval_set": [(X_valid, y_valid)], 
              "XGBmodel__early_stopping_rounds": 10, 
              "XGBmodel__verbose": False}

我正在尝试拟合模型

searchCV = GridSearchCV(pipe, cv=5, param_grid=param_grid, fit_params=fit_params)
searchCV.fit(X_train, y_train)

但是我在使用 eval_set 时遇到错误:DataFrame.dtypes for data must be int, float or bool

but I get error on the line with eval_set: DataFrame.dtypes for data must be int, float or bool

我想这是因为验证数据没有经过预处理,但是当我在Google上搜索时,我发现到处都是通过这种方式完成的,并且似乎应该可以工作.另外,我试图找到一种方法将预处理器分别应用于验证数据,但是如果不先对训练数据进行拟合,就无法转换验证数据.

I guess it is because validation data aren't going through the preprocessing, but when I google I find that everywhere it is done by this way and seems it should work. Also I tried to find a way to apply preprocessor for validation data separately, but it is not possible to transform validation data without fitting train data before it.

完整代码

columns = num_cols + cat_cols
X_train = X_full_train[columns].copy()
X_valid = X_full_valid[columns].copy()

num_preprocessor = SimpleImputer(strategy = 'mean')
cat_preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy = 'most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', num_preprocessor, num_cols),
    ('cat', cat_preprocessor, cat_cols)
])

XGBmodel = XGBRegressor(random_state=0)
pipe = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('XGBmodel', XGBmodel)
])

param_grid = {
    "XGBmodel__n_estimators": [10, 50, 100, 500],
    "XGBmodel__learning_rate": [0.1, 0.5, 1],
}

fit_params = {"XGBmodel__eval_set": [(X_valid, y_valid)], 
              "XGBmodel__early_stopping_rounds": 10, 
              "XGBmodel__verbose": False}

searchCV = GridSearchCV(pipe, cv=5, param_grid=param_grid, fit_params=fit_params)
searchCV.fit(X_train, y_train)

有什么方法可以预处理管道中的验证数据吗?或者也许完全不同的方式来实现这个东西?

Is there any way to preprocess validation data in pipeline? Or maybe completely different way to implement this thing?

推荐答案

没有好的方法.如果在拟合模型之前有很长的变压器管道,则可以考虑在管道中拟合变压器,然后分别应用模型.

There is no good way. If you have a long pipeline of transformers before fitting a model, then you can consider to fit those in the pipeline and then apply the model separately.

潜在的问题是管道没有模型拟合中使用的验证集的概念.您可以在 LightGBM github 此处上看到讨论.他们的建议是对变压器进行预训练,然后在适合整个管道之前将其应用于验证数据.如果您使用快速变压器,可以这样做,但是在极端情况下可以使CPU时间加倍.

The underlying issue is that a pipeline has no notion of a validation set used in the model fitting. You can see a discussion on LightGBM github here. Their proposal is to pre-train transformers and apply those to the validation data before you fit the full pipeline. This can be fine, if you use fast transformers, but can double CPU time in an extreme scenario.

这篇关于XGboost:无法在管道中传递eval_set的验证数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆