使用带有管道和不带管道的Scikit Learn StandardScaler进行Keras回归 [英] Keras Regression using Scikit Learn StandardScaler with Pipeline and without Pipeline

查看:217
本文介绍了使用带有管道和不带管道的Scikit Learn StandardScaler进行Keras回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在比较使用Scikit-Learn StandardScaler的两个有关KerasRegressor程序的性能:一个使用Scikit-Learn Pipeline的程序和一个不使用Pipeline的程序.

I am comparing the performance of two programs about KerasRegressor using Scikit-Learn StandardScaler: one program with Scikit-Learn Pipeline and one program without the Pipeline.

程序1:

estimators = []
estimators.append(('standardise', StandardScaler()))
estimators.append(('multiLayerPerceptron', KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)))
pipeline = Pipeline(estimators)
log = pipeline.fit(X_train, Y_train)
Y_deep = pipeline.predict(X_test)

程序2:

scale = StandardScaler()
X_train = scale.fit_transform(X_train)
X_test = scale.fit_transform(X_test)
model_np = KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)
log = model_np.fit(X_train, Y_train)
Y_deep = model_np.predict(X_test)

我的问题是,程序1的R2得分为0.98(平均3个试验),而程序2的R2得分为0.84(平均3个试验).有人能解释这两个程序之间的区别吗?

My problem is that Program 1 can achieve R2 score as 0.98 (3 trials on average) while Program 2 only achieve R2 score as 0.84 (3 trials on average.) Can anyone explain the difference between these two programs?

推荐答案

在第二种情况下,您同时在X_trainX_test上调用StandardScaler.fit_transform().错误用法.

In the second case, you are calling StandardScaler.fit_transform() on both X_train and X_test. Its wrong usage.

您应该在X_train上呼叫fit_transform(),然后仅在X_test上呼叫transform().因为那就是Pipeline的作用. 作为文档说明, Pipeline 将:

You should call fit_transform() on X_train and then call only transform() on the X_test. Because thats what the Pipeline does. The Pipeline as the documentation states, will:

fit():

一次又一次地拟合所有变换并变换数据, 然后使用最终估算器拟合转换后的数据

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator

predict():

应用转换为数据,并使用最终估算器进行预测

Apply transforms to the data, and predict with the final estimator

所以您会看到,它将仅将transform()应用于测试数据,而不是fit_transform().

So you see, it will only apply transform() to the test data, not fit_transform().

因此,我的观点很明确,您的代码应为:

So elaborate my point, your code should be:

scale = StandardScaler()
X_train = scale.fit_transform(X_train)

#This is the change
X_test = scale.transform(X_test)

model_np = KerasRegressor(build_fn=build_nn, nb_epoch=num_epochs, batch_size=10, verbose=0)
log = model_np.fit(X_train, Y_train)
Y_deep = model_np.predict(X_test)

在测试数据上调用fit()fit_transform()会错误地将其缩放为与火车数据所用的缩放比例不同的比例.并且是预测变化的来源.

Calling fit() or fit_transform() on test data wrongly scales it to a different scale than what was used on train data. And is a source of change in prediction.

编辑:要在评论中回答问题:

Edit: To answer the question in comment:

请参见,fit_transform()只是执行fit()然后执行transform()的快捷功能.对于StandardScalerfit()不返回任何内容,仅学习数据的均值和标准差.然后transform()将学习应用于数据,以返回新的缩放数据.

See, fit_transform() is just a shortcut function for doing fit() and then transform(). For StandardScaler, fit() doesnt return anything, just learns the mean and standard deviation of data. And then transform() applies the learning on the data to return new scaled data.

所以您所说的话会导致以下两种情况:

So what you are saying leads to below two scenarios:

场景1:错误

1) X_scaled = scaler.fit_transform(X)
2) Divide the X_scaled into X_scaled_train, X_scaled_test and run your model. 
   No need to scale again.

方案2:错误(基本上等于方案1,使缩放和分散操作相反)

Scenario 2: Wrong (Basically equal to Scenario 1, reversing the scaling and spitting operations)

1) Divide the X into X_train, X_test
2) scale.fit_transform(X) [# You are not using the returned value, only fitting the data, so equivalent to scale.fit(X)]
3.a) X_train_scaled = scale.transform(X_train) #[Equals X_scaled_train in scenario 1]
3.b) X_test_scaled = scale.transform(X_test) #[Equals X_scaled_test in scenario 1]

您可以尝试任何一种情况,也许它将提高模型的性能.

You can try any of the scenario and maybe it will increase the performance of your model.

但是其中有一件非常重要的事情缺少.当您对整个数据进行缩放,然后将它们分为训练和测试时,假定您知道测试(看不见的)数据,这在实际情况下是不正确的.并会给您带来与实际结果不符的结果.因为在现实世界中,所有数据将成为我们的训练数据.这也可能导致拟合过度,因为该模型已经具有有关测试数据的一些信息.

But there is one very important thing which is missing in them. When you do scaling on the whole data and then divide them into train and test, it is assumed that you know the test (unseen) data, which will not be true in real world cases. And will give you results which will not be according to real world results. Because in the real world, whole of the data will be our training data. It may also lead to over-fitting because the model has some information about the test data already.

因此,在评估机器学习模型的性能时,建议您保留测试数据,然后再对其执行任何操作.因为这是我们看不见的数据,所以我们一无所知.所以理想的操作路径就是我回答的路径,即:

So when evaluating the performance of machine learning models, it is recommended that you keep aside the test data before performing any operations on it. Because it is our unseen data, we know nothing about it. So ideal path of operations would be the one I answered, ie.:

1) Divide X into X_train and X_test (same for y)
2) X_train_scaled = scale.fit_transform(X_train) [#Learn the mean and SD of train data]
3) X_test_scaled = scale.transform(X_test) [#Use the mean and SD learned in step2 to convert test data]
4) Use the X_train_scaled for training the model and X_test_scaled in evaluation.

希望这对您有意义.

这篇关于使用带有管道和不带管道的Scikit Learn StandardScaler进行Keras回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆