将 sklearn 的 GridSearchCV 与管道一起使用,只需预处理一次 [英] Use sklearn's GridSearchCV with a pipeline, preprocessing just once

查看:32
本文介绍了将 sklearn 的 GridSearchCV 与管道一起使用,只需预处理一次的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 scickit-learn 来调整模型超参数.我正在使用管道将预处理与估计器链接起来.我的问题的简单版本如下所示:

I'm using scickit-learn to tune a model hyper-parameters. I'm using a pipeline to have chain the preprocessing with the estimator. A simple version of my problem would look like this:

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression


grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
                    param_grid={'logisticregression__C': [0.1, 10.]},
                    cv=2,
                    refit=False)

_ = grid.fit(X=np.random.rand(10, 3),
             y=np.random.randint(2, size=(10,)))

就我而言,预处理(玩具示例中的 StandardScale())非常耗时,而且我没有调整它的任何参数.

In my case the preprocessing (what would be StandardScale() in the toy example) is time consuming, and I'm not tuning any parameter of it.

因此,当我执行示例时,StandardScaler 执行了 12 次.2 拟合/预测 * 2 cv * 3 参数.但是每次对参数 C 的不同值执行 StandardScaler 时,它都会返回相同的输出,因此计算一次,然后只运行管道的估计器部分会更有效率.

So, when I execute the example, the StandardScaler is executed 12 times. 2 fit/predict * 2 cv * 3 parameters. But every time StandardScaler is executed for a different value of the parameter C, it returns the same output, so it'd be much more efficient, to compute it once, and then just run the estimator part of the pipeline.

我可以手动拆分预处理(未调整超参数)和估计器之间的管道.但是要将预处理应用于数据,我应该只提供训练集.因此,我必须手动实现拆分,而根本不使用 GridSearchCV.

I can manually split the pipeline between the preprocessing (no hyper parameters tuned) and the estimator. But to apply the preprocessing to the data, I should provide the training set only. So, I would have to implement the splits manually, and not use GridSearchCV at all.

是否有一种简单/标准的方法可以避免在使用 GridSearchCV 时重复预处理?

Is there a simple/standard way to avoid repeating the preprocessing while using GridSearchCV?

推荐答案

更新:理想情况下,不应使用下面的答案,因为它会导致评论中讨论的数据泄漏.在这个答案中,GridSearchCV 将在已经由 StandardScaler 预处理的数据上调整超参数,这是不正确的.在大多数情况下应该无关紧要,但是对缩放过于敏感的算法会给出错误的结果.

Update: Ideally, the answer below should not be used as it leads to data leakage as discussed in comments. In this answer, GridSearchCV will tune the hyperparameters on the data already preprocessed by StandardScaler, which is not correct. In most conditions that should not matter much, but algorithms which are too sensitive to scaling will give wrong results.

本质上,GridSearchCV 也是一个估计器,实现了管道使用的 fit() 和 predict() 方法.

Essentially, GridSearchCV is also an estimator, implementing fit() and predict() methods, used by the pipeline.

所以代替:

grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
                    param_grid={'logisticregression__C': [0.1, 10.]},
                    cv=2,
                    refit=False)

这样做:

clf = make_pipeline(StandardScaler(), 
                    GridSearchCV(LogisticRegression(),
                                 param_grid={'logisticregression__C': [0.1, 10.]},
                                 cv=2,
                                 refit=True))

clf.fit()
clf.predict()

它会做的是,只调用一次 StandardScalar(),一次调用 clf.fit() 而不是你描述的多次调用.

What it will do is, call the StandardScalar() only once, for one call to clf.fit() instead of multiple calls as you described.

在管道内使用 GridSearchCV 时,将改装更改为 True.正如文档中提到的:

Changed refit to True, when GridSearchCV is used inside a pipeline. As mentioned in documentation:

改装:布尔值,默认值=真用整个数据集重新拟合最佳估计器.如果为False",则无法使用此 GridSearchCV 实例进行预测安装后.

refit : boolean, default=True Refit the best estimator with the entire dataset. If "False", it is impossible to make predictions using this GridSearchCV instance after fitting.

如果 refit=False,clf.fit() 将不起作用,因为管道内的 GridSearchCV 对象将在 fit() 后重新初始化.当 refit=True 时,GridSearchCV 将使用 fit() 中传递的整个数据的最佳评分参数组合进行重新拟合.

If refit=False, clf.fit() will have no effect because the GridSearchCV object inside the pipeline will be reinitialized after fit(). When refit=True, the GridSearchCV will be refitted with the best scoring parameter combination on the whole data that is passed in fit().

所以如果要制作pipeline,只看网格搜索的分数,只有refit=False才合适.如果要调用clf.predict()方法,必须使用refit=True,否则会抛出Not Fitted错误.

So if you want to make the pipeline, just to see the scores of the grid search, only then the refit=False is appropriate. If you want to call the clf.predict() method, refit=True must be used, else Not Fitted error will be thrown.

这篇关于将 sklearn 的 GridSearchCV 与管道一起使用,只需预处理一次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆