如何在 scikit-learn (sklearn) 的 Pipeline 中应用 StandardScaler? [英] How to apply StandardScaler in Pipeline in scikit-learn (sklearn)?

查看：33 发布时间：2021/7/16 19:53:18 python scikit-learn

本文介绍了如何在 scikit-learn (sklearn) 的 Pipeline 中应用 StandardScaler?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在下面的例子中，

pipe = Pipeline([
        ('scale', StandardScaler()),
        ('reduce_dims', PCA(n_components=4)),
        ('clf', SVC(kernel = 'linear', C = 1))])

param_grid = dict(reduce_dims__n_components=[4,6,8],
                  clf__C=np.logspace(-4, 1, 6),
                  clf__kernel=['rbf','linear'])

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2)
grid.fit(X_train, y_train)
print(grid.score(X_test, y_test))

我正在使用 StandardScaler()，这是将其应用于测试集的正确方法吗?

I am using StandardScaler(), is this the correct way to apply it to test set as well?

推荐答案

是的，这是正确的方法，但您的代码中存在一个小错误.让我将其分解为你.

当您使用 StandardScaler 作为 Pipeline 中的一个步骤时，scikit-learn 将在内部为您完成这项工作.

When you use the StandardScaler as a step inside a Pipeline then scikit-learn will internally do the job for you.

发生的事情可以描述如下:

What happens can be described as follows:

第0步:根据您在中指定的cv参数，将数据拆分为TRAINING data和TEST dataGridSearchCV.
第一步:将scaler拟合到TRAINING data
第 2 步:scaler 转换 TRAINING 数据
第 3 步:使用转换后的 TRAINING 数据
第四步:scaler用于转换TEST数据
第 5 步:训练模型 预测 使用 转换后的 TEST 数据

Step 0: The data are split into TRAINING data and TEST data according to the cv parameter that you specified in the GridSearchCV.
Step 1: the scaler is fitted on the TRAINING data
Step 2: the scaler transforms TRAINING data
Step 3: the models are fitted/trained using the transformed TRAINING data
Step 4: the scaler is used to transform the TEST data
Step 5: the trained models predict using the transformed TEST data

注意:您应该使用 grid.fit(X, y) 和 NOT grid.fit(X_train, y_train) 因为 GridSearchCV 会自动将数据拆分为训练和测试数据(这在内部发生).

Note: You should be using grid.fit(X, y) and NOT grid.fit(X_train, y_train) because the GridSearchCV will automatically split the data into training and testing data (this happen internally).

使用如下:

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA

pipe = Pipeline([
        ('scale', StandardScaler()),
        ('reduce_dims', PCA(n_components=4)),
        ('clf', SVC(kernel = 'linear', C = 1))])

param_grid = dict(reduce_dims__n_components=[4,6,8],
                  clf__C=np.logspace(-4, 1, 6),
                  clf__kernel=['rbf','linear'])

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X, y)
print(grid.best_score_)
print(grid.cv_results_)

一旦您运行此代码(当您调用 grid.fit(X, y) 时)，您就可以在 grid.fit 返回的结果对象中访问网格搜索的结果().best_score_ 成员提供对优化过程中观察到的最佳分数的访问，best_params_ 描述了实现最佳结果的参数组合.

Once you run this code (when you call grid.fit(X, y)), you can access the outcome of the grid search in the result object returned from grid.fit(). The best_score_ member provides access to the best score observed during the optimization procedure and the best_params_ describes the combination of parameters that achieved the best results.

重要编辑 1:如果您想保留原始数据集的验证数据集，请使用:

IMPORTANT EDIT 1: if you want to keep a validation dataset of the original dataset use this:

X_for_gridsearch, X_future_validation, y_for_gridsearch, y_future_validation 
    = train_test_split(X, y, test_size=0.15, random_state=1)

然后使用:

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X_for_gridsearch, y_for_gridsearch)

这篇关于如何在 scikit-learn (sklearn) 的 Pipeline 中应用 StandardScaler?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在 scikit-learn (sklearn) 的 Pipeline 中应用 StandardScaler? [英] How to apply StandardScaler in Pipeline in scikit-learn (sklearn)?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在 scikit-learn (sklearn) 的 Pipeline 中应用 StandardScaler? [英] How to apply StandardScaler in Pipeline in scikit-learn (sklearn)?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭