在 Scikit-learn 中使用 Smote 和 Gridsearchcv [英] Using Smote with Gridsearchcv in Scikit-learn

查看:57
本文介绍了在 Scikit-learn 中使用 Smote 和 Gridsearchcv的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个不平衡的数据集,并希望使用 scikit 的 gridsearchcv 进行网格搜索以调整我的模型参数.为了对数据进行过采样,我想使用 SMOTE,而且我知道我可以将其作为管道的一个阶段包含在内并将其传递给 gridsearchcv.我担心的是,我认为 smote 将同时应用于训练和验证折叠,这不是您应该做的.验证集不应过采样.整个管道将应用于两个数据集拆分是否正确?如果是,我该如何扭转这种局面?提前非常感谢

I'm dealing with an imbalanced dataset and want to do a grid search to tune my model's parameters using scikit's gridsearchcv. To oversample the data, I want to use SMOTE, and I know I can include that as a stage of a pipeline and pass it to gridsearchcv. My concern is that I think smote will be applied to both train and validation folds, which is not what you are supposed to do. The validation set should not be oversampled. Am I right that the whole pipeline will be applied to both dataset splits? And if yes, how can I turn around this? Thanks a lot in advance

推荐答案

是的,可以做到,但是使用 imblearn 管道.

Yes, it can be done, but with imblearn Pipeline.

你看,imblearn 有自己的流水线来正确处理采样器.我在一个类似的问题中对此进行了描述.

You see, imblearn has its own Pipeline to handle the samplers correctly. I described this in a similar question here.

当在 imblearn.Pipeline 对象上调用 predict() 时,它将跳过采样方法并将数据保持原样传递给下一个转换器.您可以通过查看 源来确认这一点代码在这里:

When called predict() on a imblearn.Pipeline object, it will skip the sampling method and leave the data as it is to be passed to next transformer. You can confirm that by looking at the source code here:

        if hasattr(transform, "fit_sample"):
            pass
        else:
            Xt = transform.transform(Xt)

因此要使其正常工作,您需要以下内容:

So for this to work correctly, you need the following:

from imblearn.pipeline import Pipeline
model = Pipeline([
        ('sampling', SMOTE()),
        ('classification', LogisticRegression())
    ])

grid = GridSearchCV(model, params, ...)
grid.fit(X, y)

根据需要填写详细信息,管道将负责其余的工作.

Fill the details as necessary, and the pipeline will take care of the rest.

这篇关于在 Scikit-learn 中使用 Smote 和 Gridsearchcv的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆