使用带有管道和 GridSearch 的 cross_val_score 拟合嵌套交叉验证 [英] Fitting in nested cross-validation with cross_val_score with pipeline and GridSearch

查看:26
本文介绍了使用带有管道和 GridSearch 的 cross_val_score 拟合嵌套交叉验证的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 scikit 工作,我正在尝试调整我的 XGBoost.我尝试使用管道进行嵌套交叉验证来重新缩放训练折叠(以避免数据泄漏和过度拟合),并与 GridSearchCV 并行进行参数调整和 cross_val_score 以最终获得 roc_auc 分数.

I am working in scikit and I am trying to tune my XGBoost. I made an attempt to use a nested cross-validation using the pipeline for the rescaling of the training folds (to avoid data leakage and overfitting) and in parallel with GridSearchCV for param tuning and cross_val_score to get the roc_auc score at the end.

from imblearn.pipeline import Pipeline 
from sklearn.model_selection import RepeatedKFold 
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier


std_scaling = StandardScaler() 
algo = XGBClassifier()

steps = [('std_scaling', StandardScaler()), ('algo', XGBClassifier())]

pipeline = Pipeline(steps)

parameters = {'algo__min_child_weight': [1, 2],
              'algo__subsample': [0.6, 0.9],
              'algo__max_depth': [4, 6],
              'algo__gamma': [0.1, 0.2],
              'algo__learning_rate': [0.05, 0.5, 0.3]}

cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)

clf_auc = GridSearchCV(pipeline, cv = cv1, param_grid = parameters, scoring = 'roc_auc', n_jobs=-1, return_train_score=False)

cv1 = RepeatedKFold(n_splits=2, n_repeats = 5,  random_state = 15)                       
outer_clf_auc = cross_val_score(clf_auc, X_train, y_train, cv = cv1, scoring = 'roc_auc')

问题 1.我如何将 cross_val_score 拟合到训练数据中?

Question 1. How do I fit cross_val_score to the training data?

问题 2.由于我在管道中包含了 StandardScaler(),在 cross_val_score 中包含 X_train 是否有意义,或者我应该使用标准化形式的X_train(即std_X_train)?

Question2. Since I included the StandardScaler() in the pipeline does it make sense to include the X_train in the cross_val_score or should I use a standardized form of the X_train (i.e. std_X_train)?

std_scaler = StandardScaler().fit(X_train)
std_X_train = std_scaler.transform(X_train)
std_X_test = std_scaler.transform(X_test)

推荐答案

你选择了正确的方式来避免数据泄露 - 嵌套简历.

You chose the right way to avoid data leakage as you say - nested CV.

事情是在嵌套的 CV 中,您估计的不是您可以掌握在手中"的真实估计器的分数,而是一个不存在的元估计器",它也描述了您的模型选择过程.

The thing is in nested CV what you estimate is not the score of a real estimator you can "hold in your hand", but of a non-existing "meta-estimator" which describes you model selection process as well.

含义 - 在外部交叉验证的每一轮中(在您的情况下由 cross_val_score 表示),估计器 clf_auc 进行内部 CV,它选择给定下的最佳模型折叠外部简历.因此,对于外部 CV 的每一折,您都会为内部 CV 选择的不同估算器打分.

Meaning - in every round of the outer cross validation (in your case represented by cross_val_score), the estimator clf_auc undergoes internal CV which selects the best model under the given fold of the external CV. Therefore, for every fold of the external CV you are scoring a different estimator chosen by the internal CV.

例如,在一个外部 CV 折叠中,模型评分可以是选择参数 algo__min_child_weight 为 1 的模型,而在另一个模型中将其选择为 2.

For example, in one external CV fold the model scored can be one that selected the param algo__min_child_weight to be 1, and in another a model that selected it to be 2.

因此,外部 CV 的分数代表了一个更高级别的分数:在合理的模型选择过程中,我选择的模型的泛化能力如何".

The score of the external CV therefore represents a more high-level score: "under the process of reasonable model selection, how well will my selected model generalize".

现在,如果你想用一个真实的模型来完成这个过程,你必须以某种方式选择它(cross_val_score 不会为你做那个).

Now, if you want to finish the process with a real model in hand you would have to select it in some way (cross_val_score will not do that for you).

这样做的方法是在整个数据上拟合您的内部模型.执行的意思:

The way to do that is to now fit your internal model over the entire data. meaning to perform:

clf_auc.fit(X, y)

现在是了解您在这里所做的事情的时刻:

  1. 您有一个可以使用的模型,该模型适用于所有可用数据.
  2. 当您被问到该模型在新数据上的泛化能力如何?"答案是您在嵌套简历中获得的分数 - 它将模型选择过程作为模型评分的一部分.

关于问题#2 - 如果缩放器是管道的一部分,则没有理由在外部操作 X_train.

And regarding Question #2 - if the scaler is part of the pipeline, there is no reason to manipulate the X_train externally.

这篇关于使用带有管道和 GridSearch 的 cross_val_score 拟合嵌套交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆