配合使用带有管道和GridSearch的cross_val_score嵌套的交叉验证 [英] Fitting in nested cross-validation with cross_val_score with pipeline and GridSearch
问题描述
我正在使用scikit,并且正在尝试调整XGBoost. 我尝试使用嵌套的交叉验证,使用管道对训练折叠进行重新缩放(以避免数据泄漏和过度拟合),并与GridSearchCV并行进行参数调整,并与cross_val_score并行以最终获得roc_auc分数.>
from imblearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
std_scaling = StandardScaler()
algo = XGBClassifier()
steps = [('std_scaling', StandardScaler()), ('algo', XGBClassifier())]
pipeline = Pipeline(steps)
parameters = {'algo__min_child_weight': [1, 2],
'algo__subsample': [0.6, 0.9],
'algo__max_depth': [4, 6],
'algo__gamma': [0.1, 0.2],
'algo__learning_rate': [0.05, 0.5, 0.3]}
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
clf_auc = GridSearchCV(pipeline, cv = cv1, param_grid = parameters, scoring = 'roc_auc', n_jobs=-1, return_train_score=False)
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
outer_clf_auc = cross_val_score(clf_auc, X_train, y_train, cv = cv1, scoring = 'roc_auc')
问题1.
cross_val_score
如何适合训练数据?
问题2.
由于我在管道中包括了StandardScaler()
,在cross_val_score
中包括X_train
是否有意义?还是我应该使用X_train
的标准化形式(即std_X_train
)?
std_scaler = StandardScaler().fit(X_train)
std_X_train = std_scaler.transform(X_train)
std_X_test = std_scaler.transform(X_test)
您选择了避免数据泄露的正确方法-嵌套的简历.
在嵌套的CV中,您估计的不是您可以握在手中"的真实估计量的分数,而是描述了模型选择过程的不存在的元估计量"的分数. /p>
含义-在外部交叉验证的每一轮中(在您的情况下,以 cross_val_score 表示),估算器 clf_auc 都会接受内部CV评估,该评估会根据给定条件选择最佳模型外部简历的倍数. 因此,对于外部CV的每一折,您要为内部CV选择的估算器评分.
例如,在一个外部CV折叠中,评分模型可以是将参数 algo__min_child_weight 选择为1的模型,而在另一模型中将参数选择为2的模型.
因此,外部简历的得分代表了更高层次的得分:在合理的模型选择过程中,我选择的模型将得到多大的概括".
现在,如果您想用一个真实的模型来完成此过程,则必须以某种方式选择它(cross_val_score不会为您完成此操作).
这样做的方法是现在使您的内部模型适合整个数据. 执行的意义:
clf_auc.fit(X, y)
现在是时候了解您在这里所做的事情:
- 您有一个可以使用的模型,该模型适合所有可用数据.
- 当系统询问您该模型在新数据上的推广程度如何?"答案就是您在嵌套简历中获得的分数-该分数反映了模型选择过程中模型评分的一部分.
关于问题2-如果缩放器是管道的一部分,则没有理由在外部操纵X_train.
I am working in scikit and I am trying to tune my XGBoost. I made an attempt to use a nested cross-validation using the pipeline for the rescaling of the training folds (to avoid data leakage and overfitting) and in parallel with GridSearchCV for param tuning and cross_val_score to get the roc_auc score at the end.
from imblearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
std_scaling = StandardScaler()
algo = XGBClassifier()
steps = [('std_scaling', StandardScaler()), ('algo', XGBClassifier())]
pipeline = Pipeline(steps)
parameters = {'algo__min_child_weight': [1, 2],
'algo__subsample': [0.6, 0.9],
'algo__max_depth': [4, 6],
'algo__gamma': [0.1, 0.2],
'algo__learning_rate': [0.05, 0.5, 0.3]}
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
clf_auc = GridSearchCV(pipeline, cv = cv1, param_grid = parameters, scoring = 'roc_auc', n_jobs=-1, return_train_score=False)
cv1 = RepeatedKFold(n_splits=2, n_repeats = 5, random_state = 15)
outer_clf_auc = cross_val_score(clf_auc, X_train, y_train, cv = cv1, scoring = 'roc_auc')
Question 1.
How do I fit cross_val_score
to the training data?
Question2.
Since I included the StandardScaler()
in the pipeline does it make sense to include the X_train
in the cross_val_score
or should I use a standardized form of the X_train
(i.e. std_X_train
)?
std_scaler = StandardScaler().fit(X_train)
std_X_train = std_scaler.transform(X_train)
std_X_test = std_scaler.transform(X_test)
You chose the right way to avoid data leakage as you say - nested CV.
The thing is in nested CV what you estimate is not the score of a real estimator you can "hold in your hand", but of a non-existing "meta-estimator" which describes you model selection process as well.
Meaning - in every round of the outer cross validation (in your case represented by cross_val_score), the estimator clf_auc undergoes internal CV which selects the best model under the given fold of the external CV. Therefore, for every fold of the external CV you are scoring a different estimator chosen by the internal CV.
For example, in one external CV fold the model scored can be one that selected the param algo__min_child_weight to be 1, and in another a model that selected it to be 2.
The score of the external CV therefore represents a more high-level score: "under the process of reasonable model selection, how well will my selected model generalize".
Now, if you want to finish the process with a real model in hand you would have to select it in some way (cross_val_score will not do that for you).
The way to do that is to now fit your internal model over the entire data. meaning to perform:
clf_auc.fit(X, y)
This is the moment to understand what you've done here:
- You have a model you can use, which is fitted over all the data available.
- When you're asked "how well does that model generalizes on new data?" the answer is the score you got during your nested CV - which captured the model selection process as part of your model's scoring.
And regarding Question #2 - if the scaler is part of the pipeline, there is no reason to manipulate the X_train externally.
这篇关于配合使用带有管道和GridSearch的cross_val_score嵌套的交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!