在拆分训练和测试数据之前或之后标准化数据? [英] Normalize data before or after split of training and testing data?

查看:72
本文介绍了在拆分训练和测试数据之前或之后标准化数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将我的数据分成训练集和测试集,我应该在拆分之前还是之后对数据应用标准化?在构建预测模型时有什么不同吗?

解决方案

您首先需要将数据拆分为训练集和测试集(验证集也可能有用).

不要忘记测试数据点代表真实世界的数据.解释(或预测)变量的特征归一化(或数据标准化)是一种用于通过减去均值并除以方差来使数据居中和归一化的技术.如果您采用整个数据集的均值和方差,您会将未来信息引入训练解释变量(即均值和方差).

因此,您应该对训练数据进行特征归一化.然后也对测试实例执行归一化,但这次使用训练解释变量的均值和方差.通过这种方式,我们可以测试和评估我们的模型是否可以很好地泛化到新的、看不见的数据点.

要更全面地阅读,您可以阅读我的文章 简而言之特征缩放和归一化


举个例子,假设我们有以下数据:

<预><代码>>>>将 numpy 导入为 np>>>>>>X, y = np.arange(10).reshape((5, 2)), range(5)

其中 X 代表我们的功能:

<预><代码>>>>X[[0 1][2 3][4 5][6 7][8 9]]

Y包含对应的标签

<预><代码>>>>列表(y)>>>[0, 1, 2, 3, 4]


第 1 步:创建训练/测试集

<预><代码>>>>X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)>>>X_train[[4 5][0 1][6 7]]>>>>>>X_test[[2 3][8 9]]>>>>>>y_train[2, 0, 3]>>>>>>y_test[1, 4]

第 2 步:规范化训练数据

<预><代码>>>>从 sklearn 导入预处理>>>>>>normalizer = preprocessing.Normalizer()>>>normalized_train_X = normalizer.fit_transform(X_train)>>>normalized_train_X数组([[0.62469505, 0.78086881],[0., 1. ],[0.65079137, 0.7592566]])

第 3 步:标准化测试数据

<预><代码>>>>normalized_test_X = normalizer.transform(X_test)>>>normalized_test_X数组([[0.5547002, 0.83205029],[0.66436384, 0.74740932]])

I want to separate my data into train and test set, should I apply normalization over data before or after the split? Does it make any difference while building predictive model?

解决方案

You first need to split the data into training and test set (validation set could be useful too).

Don't forget that testing data points represent real-world data. Feature normalization (or data standardization) of the explanatory (or predictor) variables is a technique used to center and normalise the data by subtracting the mean and dividing by the variance. If you take the mean and variance of the whole dataset you'll be introducing future information into the training explanatory variables (i.e. the mean and variance).

Therefore, you should perform feature normalisation over the training data. Then perform normalisation on testing instances as well, but this time using the mean and variance of training explanatory variables. In this way, we can test and evaluate whether our model can generalize well to new, unseen data points.

For a more comprehensive read, you can read my article Feature Scaling and Normalisation in a nutshell


As an example, assuming we have the following data:

>>> import numpy as np
>>> 
>>> X, y = np.arange(10).reshape((5, 2)), range(5)

where X represents our features:

>>> X
[[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]

and Y contains the corresponding label

>>> list(y)
>>> [0, 1, 2, 3, 4]


Step 1: Create training/testing sets

>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

>>> X_train
[[4 5]
 [0 1]
 [6 7]]
>>>
>>> X_test
[[2 3]
 [8 9]]
>>>
>>> y_train
[2, 0, 3]
>>>
>>> y_test
[1, 4]

Step 2: Normalise training data

>>> from sklearn import preprocessing
>>> 
>>> normalizer = preprocessing.Normalizer()
>>> normalized_train_X = normalizer.fit_transform(X_train)
>>> normalized_train_X
array([[0.62469505, 0.78086881],
       [0.        , 1.        ],
       [0.65079137, 0.7592566 ]])

Step 3: Normalize testing data

>>> normalized_test_X = normalizer.transform(X_test)
>>> normalized_test_X
array([[0.5547002 , 0.83205029],
       [0.66436384, 0.74740932]])

这篇关于在拆分训练和测试数据之前或之后标准化数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆