先PCA还是先标准化? [英] PCA first or normalization first?

查看:41
本文介绍了先PCA还是先标准化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在进行回归或分类时,预处理数据的正确(或更好)方法是什么?

When doing regression or classification, what is the correct (or better) way to preprocess the data?

  1. 规范化数据 -> PCA -> 训练
  2. PCA -> 标准化 PCA 输出 -> 训练
  3. 规范化数据 -> PCA -> 规范化 PCA 输出 -> 训练

以上哪个更正确,或者是预处理数据的标准化"方式?标准化"是指标准化、线性缩放或其他一些技术.

Which of the above is more correct, or is the "standardized" way to preprocess the data? By "normalize" I mean either standardization, linear scaling or some other techniques.

推荐答案

你应该在做 PCA 之前规范化数据.例如,请考虑以下情况.我用已知的相关矩阵 C 创建了一个数据集 X:

You should normalize the data before doing PCA. For example, consider the following situation. I create a data set X with a known correlation matrix C:

>> C = [1 0.5; 0.5 1];
>> A = chol(rho);
>> X = randn(100,2) * A;

如果我现在执行 PCA,我会正确地发现主成分(权重向量的行)与坐标轴成一定角度:

If I now perform PCA, I correctly find that the principal components (the rows of the weights vector) are oriented at an angle to the coordinate axes:

>> wts=pca(X)
wts =
    0.6659    0.7461
   -0.7461    0.6659

如果我现在将数据集的第一个特征缩放 100,直观上我们认为主成分不应该改变:

If I now scale the first feature of the data set by 100, intuitively we think that the principal components shouldn't change:

>> Y = X;
>> Y(:,1) = 100 * Y(:,1);

然而,我们现在发现主成分与坐标轴对齐:

However, we now find that the principal components are aligned with the coordinate axes:

>> wts=pca(Y)
wts =
    1.0000    0.0056
   -0.0056    1.0000

要解决此问题,有两种选择.首先,我可以重新调整数据:

To resolve this, there are two options. First, I could rescale the data:

>> Ynorm = bsxfun(@rdivide,Y,std(Y))

(奇怪的 bsxfun 符号用于在 Matlab 中进行向量矩阵运算——我所做的只是减去平均值并除以每个特征的标准偏差).

(The weird bsxfun notation is used to do vector-matrix arithmetic in Matlab - all I'm doing is subtracting the mean and dividing by the standard deviation of each feature).

我们现在从 PCA 得到了合理的结果:

We now get sensible results from PCA:

>> wts = pca(Ynorm)
wts =
   -0.7125   -0.7016
    0.7016   -0.7125

它们与原始数据上的 PCA 略有不同,因为我们现在保证我们的特征具有单位标准偏差,而最初并非如此.

They're slightly different to the PCA on the original data because we've now guaranteed that our features have unit standard deviation, which wasn't the case originally.

另一种选择是使用数据的相关矩阵而不是外积来执行 PCA:

The other option is to perform PCA using the correlation matrix of the data, instead of the outer product:

>> wts = pca(Y,'corr')
wts =
    0.7071    0.7071
   -0.7071    0.7071

实际上这完全等同于通过减去均值然后除以标准差来标准化数据.只是更方便.在我看来,您应该总是这样做,除非您有充分的理由不这样做(例如,如果您想要找出每个功能变化的差异).

In fact this is completely equivalent to standardizing the data by subtracting the mean and then dividing by the standard deviation. It's just more convenient. In my opinion you should always do this unless you have a good reason not to (e.g. if you want to pick up differences in the variation of each feature).

这篇关于先PCA还是先标准化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆