Scikit-结合比例尺和网格搜索 [英] Scikit - Combining scale and grid search

查看:101
本文介绍了Scikit-结合比例尺和网格搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是scikit的新手,结合数据缩放和网格搜索有2个小问题。

I am new to scikit, and have 2 slight issues to combine a data scale and grid search.


  1. 高效缩放器

考虑使用Kfolds进行交叉验证,我希望每次我们在K-1折叠上训练模型时,数据缩放器(使用预处理例如。StandardScaler()仅适合K-1折叠,然后应用于其余折叠。

Considering a cross validation using Kfolds, I would like that each time we train the model on the K-1 folds, the data scaler (using preprocessing.StandardScaler() for instance) is fit only on the K-1 folds and then apply to the remaining fold.

我的印象是,以下代码将适合整个数据集上的缩放器,因此我想将其修改为先前描述的行为:

My impression is that the following code, will fit the scaler on the entire dataset, and therefore I would like to modify it to behave as described previsouly:

classifier = svm.SVC(C=1)    
clf = make_pipeline(preprocessing.StandardScaler(), classifier)
tuned_parameters = [{'C': [1, 10, 100, 1000]}]
my_grid_search = GridSearchCV(clf, tuned_parameters, cv=5)




  1. 检索内部缩放比例拟合

当refit = True时,在网格搜索之后,模型为在整个数据集上重新拟合(使用最佳估计器),我的理解是将再次使用管道,因此缩放器将适合整个数据集。理想情况下,我想重用适合我的测试数据集的规模。有没有办法直接从GridSearchCV检索它?

When refit=True, "after" the Grid Search, the model is refit (using the best estimator) on the entire dataset, my understanding is that the pipeline will be used again, and therefore the scaler will be fit on the entire dataset. Ideally I would like to reuse that fit to scale my 'test' dataset. Is there a way to retrieve it directly from the GridSearchCV?

推荐答案


  1. GridSearchCV对管道一无所知目的;它假定提供的估计量是原子的,因为它不能仅选择某个特定阶段(例如,StandartScaler),也不能在不同数据上适合不同阶段。
    所有GridSearchCV都执行-在提供的估算器上调用fit(X,y)方法,其中X,y-进行一些数据拆分。

  2. 尝试以下操作:

  1. GridSearchCV knows nothing about the Pipeline object; it assumes that the provided estimator is atomic in the sense that it cannot choose only some particular stage (StandartScaler for example) and fit different stages on different data. All GridSearchCV does - calls fit(X, y) method on the provided estimator, where X,y - some splits of data. Thus it fits all stages on same splits.
  2. Try this:

best_pipeline = my_grid_search .best_estimator_
best_scaler = best_pipeline [ standartscaler]

以防将变压器/估计器包装到管道中-您必须在每个参数的名称前添加前缀,例如: tuned_pa​​rameters = [{'svc__C':[1、10、100、1000]}] ,有关这些详细信息,请参见这些示例,串联多种特征提取方法管道:链接PCA和逻辑回归

In case when you wrap your transformers/estimators into Pipeline - you have to add a prefix to a name of each parameter, e.g: tuned_parameters = [{'svc__C': [1, 10, 100, 1000]}], look at these examples for more details Concatenating multiple feature extraction methods, Pipelining: chaining a PCA and a logistic regression

无论如何阅读,它可能会帮助您 GridSearchCV

Anyway read this, it may help you GridSearchCV

这篇关于Scikit-结合比例尺和网格搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆