在分割训练和测试数据之前或之后对数据进行归一化? [英] Normalize data before or after split of training and testing data?

查看:1063
本文介绍了在分割训练和测试数据之前或之后对数据进行归一化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将数据分为训练集和测试集,是否应该在拆分之前或之后对数据进行归一化处理?建立预测模型时,它有什么区别吗?预先感谢.

I want to separate my data into train and test set, should I apply normalization over data before or after the split? Does it make any difference while building predictive model? Thanks in advance.

推荐答案

您首先需要将数据分为训练和测试集(可能还需要验证集).

You first need to split the data into training and test set (validation set might also be required).

请不要忘记测试数据点代表了真实的数据. 说明性(或预测变量)变量的特征归一化(或数据标准化)是一种用于通过减去均值并除以方差来对数据进行居中和归一化的技术.如果您采用整个数据集的均值和方差,则会将未来信息引入训练解释变量(即均值和方差).

Don't forget that testing data points represent real-world data. Feature normalization (or data standardization) of the explanatory (or predictor) variables is a technique used to center and normalise the data by subtracting the mean and dividing by the variance. If you take the mean and variance of the whole dataset you'll be introducing future information into the training explanatory variables (i.e. the mean and variance).

因此,您应该对训练数据进行特征归一化.然后也对测试实例执行归一化,但是这次使用训练解释变量的均值和方差.这样,我们可以测试和评估我们的模型是否可以很好地推广到新的,看不见的数据点.

Therefore, you should perform feature normalisation over the training data. Then perform normalisation on testing instances as well, but this time using the mean and variance of training explanatory variables. In this way, we can test and evaluate whether our model can generalize well to new, unseen data points.

这篇关于在分割训练和测试数据之前或之后对数据进行归一化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆