在数据不平衡的管道中进行交叉验证的正确方法 [英] Correct way to do cross validation in a pipeline with imbalanced data

查看:319
本文介绍了在数据不平衡的管道中进行交叉验证的正确方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于给定的不平衡数据,我为标准化创建了不同的管道。一种热编码

For the given imbalanced data , I have created a different pipelines for standardization & one hot encoding

numeric_transformer = Pipeline(steps = [('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=['ohe', OneHotCategoricalEncoder()])

之后,将上面的管道保持在一个列转换器中

After that a column transformer keeping the above pipelines in one

from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer,categorical_features)]

最后的管道如下

smt = SMOTE(random_state=42)
rf = pl1([('preprocessor', preprocessor),('smote',smt),
                      ('classifier', RandomForestClassifier())])

我在做点子eline适合不平衡数据,因此我将SMOTE技术以及预处理和分类器包括在内。因为它不平衡,所以我想检查召回分数。

I am doing the pipeline fit on imbalanced data so i have included the SMOTE technique along with the pre-processing and classifier. As it is imbalanced I want to check for the recall score.

下面的代码所示的正确方法是吗?我正在召回0.98左右,这可能会导致模型过拟合。有什么建议吗?

Is the correct way as shown in the code below? I am getting recall around 0.98 which can cause the model to overfit. Any suggestions if I am making any mistake?

scores = cross_val_score(rf, X, y, cv=5,scoring="recall")


推荐答案

在不平衡的环境中,一个重要的问题是确保每个简历折叠中有足够的少数族裔成员;因此,建议使用 StratifiedKFold 来强制执行该建议,即:

The important concern in imbalanced settings is to ensure that enough members of the minority class will be present in each CV fold; thus, it would seem advisable to enforce that using StratifiedKFold, i.e.:

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5)  
scores = cross_val_score(rf, X, y, cv=skf, scoring="recall")

但是,即使使用 cross_val_score 就像您所做的那样(即简单地使用 cv = 5 ),scikit-learn会照顾它并参与确实是分层的简历;来自 docs

Nevertheless, it turns out that even when using the cross_val_score as you do (i.e. simply with cv=5), scikit-learn takes care of it and engages a stratified CV indeed; from the docs:


cv: int,交叉验证生成器或可迭代的default = None

cv : int, cross-validation generator or an iterable, default=None


  • 无,要使用默认的5倍交叉验证,

  • None, to use the default 5-fold cross validation,

int,以指定(分层的)KFold 中的折叠次数。

int, to specify the number of folds in a (Stratified)KFold.

对于int / None输入,如果估计量是分类器,并且 y 是二进制或多类
,则 StratifiedKFold 被使用。在所有其他情况下,都使用
KFold

For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

,按原样使用您的代码:

So, using your code as is:

scores = cross_val_score(rf, X, y, cv=5, scoring="recall")

确实很好。

这篇关于在数据不平衡的管道中进行交叉验证的正确方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆