在Scikit-learn中将smote与Gridsearchcv一起使用 [英] Using Smote with Gridsearchcv in Scikit-learn

查看:178
本文介绍了在Scikit-learn中将smote与Gridsearchcv一起使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理不平衡的数据集,并希望使用scikit的gridsearchcv进行网格搜索以调整模型的参数.为了对数据进行过采样,我想使用SMOTE,我知道我可以将其作为管道的一个阶段,并将其传递给gridsearchcv. 我担心的是,我认为击打将同时应用于训练和验证褶皱,这不是您应该做的.验证集不应过采样. 我是否正确,整个管道将应用于两个数据集拆分?如果是的话,我该如何扭转呢? 提前谢谢

I'm dealing with an imbalanced dataset and want to do a grid search to tune my model's parameters using scikit's gridsearchcv. To oversample the data, I want to use SMOTE, and I know I can include that as a stage of a pipeline and pass it to gridsearchcv. My concern is that I think smote will be applied to both train and validation folds, which is not what you are supposed to do. The validation set should not be oversampled. Am I right that the whole pipeline will be applied to both dataset splits? And if yes, how can I turn around this? Thanks a lot in advance

推荐答案

可以,但是可以使用您会看到,imblearn有自己的管道来正确处理采样器.我在此处有类似问题中对此进行了描述.

You see, imblearn has its own Pipeline to handle the samplers correctly. I described this in a similar question here.

imblearn.Pipeline对象上调用predict()时,它将跳过采样方法,并保留要传递给下一个转换器的数据. 您可以通过查看源代码来确认代码在这里:

When called predict() on a imblearn.Pipeline object, it will skip the sampling method and leave the data as it is to be passed to next transformer. You can confirm that by looking at the source code here:

        if hasattr(transform, "fit_sample"):
            pass
        else:
            Xt = transform.transform(Xt)

因此,要使其正常工作,您需要执行以下操作:

So for this to work correctly, you need the following:

from imblearn.pipeline import Pipeline
model = Pipeline([
        ('sampling', SMOTE()),
        ('classification', LogisticRegression())
    ])

grid = GridSearchCV(model, params, ...)
grid.fit(X, y)

根据需要填写详细信息,管道将负责其余的工作.

Fill the details as necessary, and the pipeline will take care of the rest.

这篇关于在Scikit-learn中将smote与Gridsearchcv一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆