如何在 scikit-learn (sklearn) 的 Pipeline 中应用 StandardScaler? [英] How to apply StandardScaler in Pipeline in scikit-learn (sklearn)?

查看:33
本文介绍了如何在 scikit-learn (sklearn) 的 Pipeline 中应用 StandardScaler?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在下面的例子中,

pipe = Pipeline([
        ('scale', StandardScaler()),
        ('reduce_dims', PCA(n_components=4)),
        ('clf', SVC(kernel = 'linear', C = 1))])

param_grid = dict(reduce_dims__n_components=[4,6,8],
                  clf__C=np.logspace(-4, 1, 6),
                  clf__kernel=['rbf','linear'])

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2)
grid.fit(X_train, y_train)
print(grid.score(X_test, y_test))

我正在使用 StandardScaler(),这是将其应用于测试集的正确方法吗?

I am using StandardScaler(), is this the correct way to apply it to test set as well?

推荐答案

是的,这是正确的方法,但您的代码中存在一个小错误.让我将其分解为你.

当您使用 StandardScaler 作为 Pipeline 中的一个步骤时,scikit-learn 将在内部为您完成这项工作.

When you use the StandardScaler as a step inside a Pipeline then scikit-learn will internally do the job for you.

发生的事情可以描述如下:

What happens can be described as follows:

  • 第0步:根据您在中指定的cv参数,将数据拆分为TRAINING dataTEST dataGridSearchCV.
  • 第一步:将scaler拟合到TRAINING data
  • 第 2 步:scaler 转换 TRAINING 数据
  • 第 3 步:使用转换后的 TRAINING 数据
  • 拟合/训练模型
  • 第四步:scaler用于转换TEST数据
  • 第 5 步:训练模型 预测 使用 转换后的 TEST 数据
  • Step 0: The data are split into TRAINING data and TEST data according to the cv parameter that you specified in the GridSearchCV.
  • Step 1: the scaler is fitted on the TRAINING data
  • Step 2: the scaler transforms TRAINING data
  • Step 3: the models are fitted/trained using the transformed TRAINING data
  • Step 4: the scaler is used to transform the TEST data
  • Step 5: the trained models predict using the transformed TEST data

注意:您应该使用 grid.fit(X, y)NOT grid.fit(X_train, y_train) 因为 GridSearchCV 会自动将数据拆分为训练和测试数据(这在内部发生).

Note: You should be using grid.fit(X, y) and NOT grid.fit(X_train, y_train) because the GridSearchCV will automatically split the data into training and testing data (this happen internally).

使用如下:

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA

pipe = Pipeline([
        ('scale', StandardScaler()),
        ('reduce_dims', PCA(n_components=4)),
        ('clf', SVC(kernel = 'linear', C = 1))])

param_grid = dict(reduce_dims__n_components=[4,6,8],
                  clf__C=np.logspace(-4, 1, 6),
                  clf__kernel=['rbf','linear'])

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X, y)
print(grid.best_score_)
print(grid.cv_results_)


一旦您运行此代码(当您调用 grid.fit(X, y) 时),您就可以在 grid.fit 返回的结果对象中访问网格搜索的结果().best_score_ 成员提供对优化过程中观察到的最佳分数的访问,best_params_ 描述了实现最佳结果的参数组合.


Once you run this code (when you call grid.fit(X, y)), you can access the outcome of the grid search in the result object returned from grid.fit(). The best_score_ member provides access to the best score observed during the optimization procedure and the best_params_ describes the combination of parameters that achieved the best results.

重要编辑 1:如果您想保留原始数据集的验证数据集,请使用:

IMPORTANT EDIT 1: if you want to keep a validation dataset of the original dataset use this:

X_for_gridsearch, X_future_validation, y_for_gridsearch, y_future_validation 
    = train_test_split(X, y, test_size=0.15, random_state=1)

然后使用:

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X_for_gridsearch, y_for_gridsearch)

这篇关于如何在 scikit-learn (sklearn) 的 Pipeline 中应用 StandardScaler?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆