使用 imblearn 管道进行交叉验证之前或之后是否会发生过采样? [英] Does oversampling happen before or after cross-validation using imblearn pipelines?

查看:75
本文介绍了使用 imblearn 管道进行交叉验证之前或之后是否会发生过采样?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在对训练数据进行交叉验证以验证我的超参数之前,我已将我的数据拆分为训练/测试.我有一个不平衡的数据集,想在每次迭代中执行 SMOTE 过采样,所以我使用 imblearn 建立了一个管道.

I have split my data into train/test before doing cross-validation on the training data to validate my hyperparameters. I have an unbalanced dataset and want to perform SMOTE oversampling on each iteration, so I have established a pipeline using imblearn.

我的理解是应该在将数据分成k-fold后进行过采样,以防止信息泄露.在下面的设置中使用 Pipeline 时,是否保留了这种操作顺序(数据拆分为 k 折、k-1 折过采样、预测剩余折叠)?

My understanding is that oversampling should be done after dividing the data into k-folds to prevent information leaking. Is this order of operations (data split into k-folds, k-1 folds oversampled, predict on remaining fold) preserved when using Pipeline in the setup below?

from imblearn.pipeline import Pipeline
model = Pipeline([
        ('sampling', SMOTE()),
        ('classification', xgb.XGBClassifier())
    ])


param_dist = {'classification__n_estimators': stats.randint(50, 500),
              'classification__learning_rate': stats.uniform(0.01, 0.3),
              'classification__subsample': stats.uniform(0.3, 0.6),
              'classification__max_depth': [3, 4, 5, 6, 7, 8, 9],
              'classification__colsample_bytree': stats.uniform(0.5, 0.5),
              'classification__min_child_weight': [1, 2, 3, 4],
              'sampling__ratio': np.linspace(0.25, 0.5, 10)
             }

random_search = RandomizedSearchCV(model,
                                   param_dist,
                                   cv=StratifiedKFold(n_splits=5),
                                   n_iter=10,
                                   scoring=scorer_cv_cost_savings)
random_search.fit(X_train.values, y_train)

推荐答案

你的理解是对的.当您将 pipeline 作为 model 提供时,使用 .fit() 应用训练数据 (k-1)> 并且测试是在 k th 折叠上完成的.然后对训练数据进行采样.

Your understanding is right. When you feed the pipeline as model, the training data (k-1) is applied using .fit() and testing is done on the kth fold. Then sampling would be done on the training data.

imblearn.pipeline .fit() 的文档 说:

The documentation for imblearn.pipeline .fit() says:

拟合模型

一个接一个地拟合所有变换/采样器并对数据进行变换/采样,然后使用最终估计器拟合转换/采样的数据.

Fit all the transforms/samplers one after the other and transform/sample the data, then fit the transformed/sampled data using the final estimator.

这篇关于使用 imblearn 管道进行交叉验证之前或之后是否会发生过采样?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆