如何在 scikit-learn (sklearn) 的 Pipeline 中应用 StandardScaler? [英] How to apply StandardScaler in Pipeline in scikit-learn (sklearn)?
问题描述
在下面的例子中,
pipe = Pipeline([
('scale', StandardScaler()),
('reduce_dims', PCA(n_components=4)),
('clf', SVC(kernel = 'linear', C = 1))])
param_grid = dict(reduce_dims__n_components=[4,6,8],
clf__C=np.logspace(-4, 1, 6),
clf__kernel=['rbf','linear'])
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2)
grid.fit(X_train, y_train)
print(grid.score(X_test, y_test))
我正在使用 StandardScaler()
,这是将其应用于测试集的正确方法吗?
I am using StandardScaler()
, is this the correct way to apply it to test set as well?
推荐答案
是的,这是正确的方法,但您的代码中存在一个小错误.让我将其分解为你.
当您使用 StandardScaler
作为 Pipeline
中的一个步骤时,scikit-learn 将在内部为您完成这项工作.
When you use the StandardScaler
as a step inside a Pipeline
then scikit-learn will internally do the job for you.
发生的事情可以描述如下:
What happens can be described as follows:
- 第0步:根据您在
中指定的
.cv
参数,将数据拆分为TRAINING data
和TEST data
GridSearchCV - 第一步:将
scaler
拟合到TRAINING data
- 第 2 步:
scaler
转换TRAINING 数据
- 第 3 步:使用转换后的
TRAINING 数据
拟合/训练模型 - 第四步:
scaler
用于转换TEST数据
- 第 5 步:训练模型
预测
使用转换后的 TEST 数据
- Step 0: The data are split into
TRAINING data
andTEST data
according to thecv
parameter that you specified in theGridSearchCV
. - Step 1: the
scaler
is fitted on theTRAINING data
- Step 2: the
scaler
transformsTRAINING data
- Step 3: the models are fitted/trained using the transformed
TRAINING data
- Step 4: the
scaler
is used to transform theTEST data
- Step 5: the trained models
predict
using thetransformed TEST data
注意:您应该使用 grid.fit(X, y)
和 NOT grid.fit(X_train, y_train)
因为 GridSearchCV
会自动将数据拆分为训练和测试数据(这在内部发生).
Note: You should be using grid.fit(X, y)
and NOT grid.fit(X_train, y_train)
because the GridSearchCV
will automatically split the data into training and testing data (this happen internally).
使用如下:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
pipe = Pipeline([
('scale', StandardScaler()),
('reduce_dims', PCA(n_components=4)),
('clf', SVC(kernel = 'linear', C = 1))])
param_grid = dict(reduce_dims__n_components=[4,6,8],
clf__C=np.logspace(-4, 1, 6),
clf__kernel=['rbf','linear'])
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X, y)
print(grid.best_score_)
print(grid.cv_results_)
一旦您运行此代码(当您调用 grid.fit(X, y)
时),您就可以在 grid.fit 返回的结果对象中访问网格搜索的结果().best_score_
成员提供对优化过程中观察到的最佳分数的访问,best_params_
描述了实现最佳结果的参数组合.
Once you run this code (when you call grid.fit(X, y)
), you can access the outcome of the grid search in the result object returned from grid.fit(). The best_score_
member provides access to the best score observed during the optimization procedure and the best_params_
describes the combination of parameters that achieved the best results.
重要编辑 1:如果您想保留原始数据集的验证数据集,请使用:
IMPORTANT EDIT 1: if you want to keep a validation dataset of the original dataset use this:
X_for_gridsearch, X_future_validation, y_for_gridsearch, y_future_validation
= train_test_split(X, y, test_size=0.15, random_state=1)
然后使用:
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X_for_gridsearch, y_for_gridsearch)
这篇关于如何在 scikit-learn (sklearn) 的 Pipeline 中应用 StandardScaler?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!