Matlab:如何在Matlab中使用PCA查找数据集中哪些变量可以被丢弃? [英] Matlab: how to find which variables from dataset could be discarded using PCA in matlab?

查看:139
本文介绍了Matlab:如何在Matlab中使用PCA查找数据集中哪些变量可以被丢弃?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用PCA找出我的数据集中哪些变量由于与其他变量高度相关而被冗余了.我在以前使用zscore标准化的数据上使用princomp matlab函数:

I am using PCA to find out which variables in my dataset are redundand due to being highly correlated with other variables. I am using princomp matlab function on the data previously normalized using zscore:

[coeff, PC, eigenvalues] = princomp(zscore(x))

我知道特征值告诉我数据集涵盖每个主成分的差异有多少,而 coeff 告诉我数据集中第i个原始变量有多少第j个主成分(其中i-行,j-列).

I know that eigenvalues tell me how much variation of the dataset covers every principal component, and that coeff tells me how much of i-th original variable is in the j-th principal component (where i - rows, j - columns).

因此,我假设要找出原始数据集中哪些变量最重要,哪些变量最小,我应该将 coeff 矩阵乘以特征值- coeff 值表示每个组件具有多少每个变量,而特征值则表明该组件有多重要. 这是我的完整代码:

So I assumed that to find out which variables out of the original dataset are the most important and which are the least I should multiply the coeff matrix by eigenvalues - coeff values represent how much of every variable each component has and eigenvalues tell how important this component is. So this is my full code:

[coeff, PC, eigenvalues] = princomp(zscore(x));
e = eigenvalues./sum(eigenvalues);
abs(coeff)/e

但这并没有真正显示任何内容-我在以下变量集上进行了尝试,其中变量1与变量2(v2 = v1 + 2)完全相关:

But this does not really show anything - I tried it on a following set, where variable 1 is fully correlated with variable 2 (v2 = v1 + 2):

     v1    v2    v3
     1     3     4
     2     4    -1
     4     6     9
     3     5    -2

但是我的计算结果如下:

but the results of my calculations were following:

v1 0.5525
v2 0.5525
v3 0.5264

这并没有真正显示任何内容.我希望变量2的结果表明它远不如v1或v3重要. 我的哪项假设错了?

and this does not really show anything. I would expect the result for variable 2 show that it is far less important than v1 or v3. Which of my assuptions is wrong?

推荐答案

编辑我已经完全重新设计了答案,因为我了解哪些假设是错误的.

EDIT I have completely reworked the answer now that I understand which assumptions were wrong.

在解释OP中不起作用的内容之前,让我确保我们使用相同的术语.在主成分分析中,目标是获得一个坐标变换,该坐标变换可以很好地分隔观察值,并且可以轻松地在低维空间中描述数据(即不同的多维观察值).由多个测量组成的观察是多维的.如果线性独立观测值少于测量值,则我们期望至少一个特征值为零,因为例如3D空间中的两个线性独立观察向量可以由2D平面描述.

Before explaining what doesn't work in the OP, let me make sure we'll have the same terminology. In principal component analysis, the goal is to obtain a coordinate transformation that separates the observations well, and that may make it easy to describe the data , i.e. the different multi-dimensional observations, in a lower-dimensional space. Observations are multidimensional when they're made up from multiple measurements. If there are fewer linearly independent observations than there are measurements, we expect at least one of the eigenvalues to be zero, because e.g. two linearly independent observation vectors in a 3D space can be described by a 2D plane.

如果我们有一个数组

x = [    1     3     4
         2     4    -1
         4     6     9
         3     5    -2];

由四个观测值组成,每个观测值包含三个测量值,princomp(x)将找到由四个观测值跨越的低维空间.由于存在两个相互依赖的测量值,因此特征值之一将接近零,因为测量空间仅为2D而非3D,这可能就是您想要找到的结果.确实,如果检查特征向量(coeff),您会发现前两个分量非常共线

that consists of four observations with three measurements each, princomp(x) will find the lower-dimensional space spanned by the four observations. Since there are two co-dependent measurements, one of the eigenvalues will be near zero, since the space of measurements is only 2D and not 3D, which is probably the result you wanted to find. Indeed, if you inspect the eigenvectors (coeff), you find that the first two components are extremely obviously collinear

coeff = princomp(x)
coeff =
      0.10124      0.69982      0.70711
      0.10124      0.69982     -0.70711
       0.9897     -0.14317   1.1102e-16

由于前两个分量实际上指向相反的方向,因此变换后的观测值的前两个分量的值本身就毫无意义:[1 1 25]等同于[1000 1000 25].

Since the first two components are, in fact, pointing in opposite directions, the values of the first two components of the transformed observations are, on their own, meaningless: [1 1 25] is equivalent to [1000 1000 25].

现在,如果我们想确定是否有任何测量结果是线性相关的,并且真的要使用主成分,因为在现实生活中,测量结果可能不是完全共线的,对于为机器学习应用程序找到描述符的良好向量很感兴趣,将这三个度量视为观察值"并运行princomp(x')更为有意义.由于因此仅存在三个观测",但是存在四个测量",因此第四特征向量将为零.但是,由于存在两个线性相关的观测值,因此我们只剩下两个非零特征值:

Now, if we want to find out whether any measurements are linearly dependent, and if we really want to use principal components for this, because in real life, measurements my not be perfectly collinear and we are interested in finding good vectors of descriptors for a machine-learning application, it makes a lot more sense to consider the three measurements as "observations", and run princomp(x'). Since there are thus three "observations" only, but four "measurements", the fourth eigenvector will be zero. However, since there are two linearly dependent observations, we're left with only two non-zero eigenvalues:

eigenvalues =
       24.263
       3.7368
            0
            0

要找出哪些测量值具有很高的相关性(如果使用特征向量转换的测量值作为机器学习的输入,则实际上并不需要),最好的方法是查看测量值之间的相关性:

To find out which of the measurements are so highly correlated (not actually necessary if you use the eigenvector-transformed measurements as input for e.g. machine learning), the best way would be to look at the correlation between the measurements:

corr(x)
  ans =
        1            1      0.35675
        1            1      0.35675
  0.35675      0.35675            1

不足为奇的是,每个度量与自身完美关联,并且v1v2完美关联.

Unsurprisingly, each measurement is perfectly correlated with itself, and v1 is perfectly correlated with v2.

EDIT2

但是特征值告诉我们新空间中的哪些向量最重要(覆盖变化最大),系数也告诉我们每个变量在每个分量中有多少.所以我假设我们可以使用这些数据来找出哪些原始变量具有最大的方差,因此是最重要的(并摆脱那些代表较小量的变量)

but the eigenvalues tell us which vectors in the new space are most important (cover the most of variation) and also coefficients tell us how much of each variable is in each component. so I assume we can use this data to find out which of the original variables hold the most of variance and thus are most important (and get rid of those that represent small amount)

如果您的观察结果显示一个测量变量的方差很小(例如,其中x = [1 2 3;1 4 22;1 25 -25;1 11 100];,因此第一个变量对方差无任何贡献),则此方法有效.但是,在共线测量中,两个向量都具有等价信息,并且对方差具有平等的贡献.因此,特征向量(系数)可能彼此相似.

This works if your observations show very little variance in one measurement variable (e.g. where x = [1 2 3;1 4 22;1 25 -25;1 11 100];, and thus the first variable contributes nothing to the variance). However, with collinear measurements, both vectors hold equivalent information, and contribute equally to the variance. Thus, the eigenvectors (coefficients) are likely to be similar to one another.

为了使@agnieszka的评论有意义,我在下面保留了我的答案的原始点1-4.请注意,#3响应特征向量对特征向量的除法,对我而言这没有多大意义.

In order for @agnieszka's comments to keep making sense, I have left the original points 1-4 of my answer below. Note that #3 was in response to the division of the eigenvectors by the eigenvalues, which to me didn't make a lot of sense.

  1. 向量应位于行中,而不是列中(每个向量都是 观察).
  2. coeff返回主体的基向量 组件,其顺序与原始输入无关
  3. 要查看主要组件的重要性,请使用eigenvalues/sum(eigenvalues)
  4. 如果有两个共线向量,则不能说第一个很重要而第二个不重要.您怎么知道不应该反过来呢?如果要测试共线性,则应改为检查数组的秩,或对归一化(即norm等于1)的向量调用unique.
  1. the vectors should be in rows, not columns (each vector is an observation).
  2. coeff returns the basis vectors of the principal components, and its order has little to do with the original input
  3. To see the importance of the principal components, you use eigenvalues/sum(eigenvalues)
  4. If you have two collinear vectors, you can't say that the first is important and the second isn't. How do you know that it shouldn't be the other way around? If you want to test for colinearity, you should check the rank of the array instead, or call unique on normalized (i.e. norm equal to 1) vectors.

这篇关于Matlab:如何在Matlab中使用PCA查找数据集中哪些变量可以被丢弃?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆