将sklearn的GridSearchCV与管道一起使用,只需预处理一次 [英] Use sklearn's GridSearchCV with a pipeline, preprocessing just once

查看:196
本文介绍了将sklearn的GridSearchCV与管道一起使用,只需预处理一次的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用scickit-learn来调整模型的超参数.我正在使用管道将预处理器与估算器链接在一起.我的问题的一个简单版本如下所示:

I'm using scickit-learn to tune a model hyper-parameters. I'm using a pipeline to have chain the preprocessing with the estimator. A simple version of my problem would look like this:

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression


grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
                    param_grid={'logisticregression__C': [0.1, 10.]},
                    cv=2,
                    refit=False)

_ = grid.fit(X=np.random.rand(10, 3),
             y=np.random.randint(2, size=(10,)))

在我的情况下,预处理(在玩具示例中为StandardScale())很耗时,并且我没有调整它的任何参数.

In my case the preprocessing (what would be StandardScale() in the toy example) is time consuming, and I'm not tuning any parameter of it.

因此,当我执行示例时,StandardScaler被执行12次. 2个适合/预测* 2个简历* 3个参数.但是,每次对参数C的不同值执行StandardScaler时,它都会返回相同的输出,因此效率更高,只需计算一次,然后运行管道的估计器部分即可.

So, when I execute the example, the StandardScaler is executed 12 times. 2 fit/predict * 2 cv * 3 parameters. But every time StandardScaler is executed for a different value of the parameter C, it returns the same output, so it'd be much more efficient, to compute it once, and then just run the estimator part of the pipeline.

我可以在预处理(不调整超参数)和估计器之间手动分配管道.但是要将预处理应用于数据,我应该只提供训练集.因此,我将不得不手动实现拆分,而根本不使用GridSearchCV.

I can manually split the pipeline between the preprocessing (no hyper parameters tuned) and the estimator. But to apply the preprocessing to the data, I should provide the training set only. So, I would have to implement the splits manually, and not use GridSearchCV at all.

是否有一种简单/标准的方法来避免在使用GridSearchCV时重复进行预处理?

Is there a simple/standard way to avoid repeating the preprocessing while using GridSearchCV?

推荐答案

更新: 理想情况下,不应使用以下答案,因为它会导致数据泄漏,如注释中所述.在此答案中,GridSearchCV将对已经由StandardScaler预处理的数据调整超参数,这是不正确的.在大多数情况下,这无关紧要,但是对缩放过于敏感的算法会产生错误的结果.

Update: Ideally, the answer below should not be used as it leads to data leakage as discussed in comments. In this answer, GridSearchCV will tune the hyperparameters on the data already preprocessed by StandardScaler, which is not correct. In most conditions that should not matter much, but algorithms which are too sensitive to scaling will give wrong results.

从本质上讲,GridSearchCV也是一个估计器,实现管道使用的fit()和predict()方法.

Essentially, GridSearchCV is also an estimator, implementing fit() and predict() methods, used by the pipeline.

所以代替:

grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
                    param_grid={'logisticregression__C': [0.1, 10.]},
                    cv=2,
                    refit=False)

执行此操作:

clf = make_pipeline(StandardScaler(), 
                    GridSearchCV(LogisticRegression(),
                                 param_grid={'logisticregression__C': [0.1, 10.]},
                                 cv=2,
                                 refit=True))

clf.fit()
clf.predict()

它将要做的是,仅一次调用StandardScalar(),一次调用clf.fit(),而不是您所描述的多次调用.

What it will do is, call the StandardScalar() only once, for one call to clf.fit() instead of multiple calls as you described.

当在管道内部使用GridSearchCV时,将调整内容更改为True.正如文档中提到的 :

Changed refit to True, when GridSearchCV is used inside a pipeline. As mentioned in documentation:

refit:布尔值,默认为True 用整个数据集重新拟合最佳估计量.如果为"False",则不可能使用此GridSearchCV实例进行预测 试穿后.

refit : boolean, default=True Refit the best estimator with the entire dataset. If "False", it is impossible to make predictions using this GridSearchCV instance after fitting.

如果refit = False,则clf.fit()将无效,因为在fit()之后将重新初始化管道内的GridSearchCV对象. 当refit=True时,将对在fit()中传递的整个数据使用最佳评分参数组合重新调整GridSearchCV.

If refit=False, clf.fit() will have no effect because the GridSearchCV object inside the pipeline will be reinitialized after fit(). When refit=True, the GridSearchCV will be refitted with the best scoring parameter combination on the whole data that is passed in fit().

因此,如果要构建管道,仅查看网格搜索的分数,则仅refit=False是合适的.如果要调用clf.predict()方法,则必须使用refit=True,否则将引发未拟合"错误.

So if you want to make the pipeline, just to see the scores of the grid search, only then the refit=False is appropriate. If you want to call the clf.predict() method, refit=True must be used, else Not Fitted error will be thrown.

这篇关于将sklearn的GridSearchCV与管道一起使用,只需预处理一次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆