我的PCA怎么了? [英] What's wrong with my PCA?

查看:110
本文介绍了我的PCA怎么了?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的代码:

from numpy import *

def pca(orig_data):
    data = array(orig_data)
    data = (data - data.mean(axis=0)) / data.std(axis=0)
    u, s, v = linalg.svd(data)
    print s #should be s**2 instead!
    print v

def load_iris(path):
    lines = []
    with open(path) as input_file:
        lines = input_file.readlines()
    data = []
    for line in lines:
        cur_line = line.rstrip().split(',')
        cur_line = cur_line[:-1]
        cur_line = [float(elem) for elem in cur_line]
        data.append(array(cur_line))
    return array(data)

if __name__ == '__main__':
    data = load_iris('iris.data')
    pca(data)

虹膜数据集: http://archive. ics.uci.edu/ml/machine-learning-databases/iris/iris.data

输出:

[ 20.89551896  11.75513248   4.7013819    1.75816839]
[[ 0.52237162 -0.26335492  0.58125401  0.56561105]
 [-0.37231836 -0.92555649 -0.02109478 -0.06541577]
 [ 0.72101681 -0.24203288 -0.14089226 -0.6338014 ]
 [ 0.26199559 -0.12413481 -0.80115427  0.52354627]]

所需输出:
特征值-[2.9108 0.9212 0.1474 0.0206]
主要组件-Same as I got but transposed好吧,我猜

Desired Output:
Eigenvalues - [2.9108 0.9212 0.1474 0.0206]
Principal Components - Same as I got but transposed so okay I guess

此外,linalg.eig函数的输出是什么?根据维基百科上PCA的描述,我应该这样做:

Also, what's with the output of the linalg.eig function? According to the PCA description on wikipedia, I'm supposed to this:

cov_mat = cov(orig_data)
val, vec = linalg.eig(cov_mat)
print val

但是它与我在网上找到的教程中的输出并不完全匹配.另外,如果我有4个维度,我想我应该有4个特征值,而不是像eig给我的那样有150个.我在做错什么吗?

But it doesn't really match the output in the tutorials I found online. Plus, if I have 4 dimensions, I thought I should have 4 eigenvalues and not 150 like the eig gives me. Am I doing something wrong?

编辑:我注意到这些值相差150,这是数据集中元素的数量.同样,本征值应该等于维数,在这种情况下为4.我不明白的是为什么会发生这种差异.如果仅将特征值除以len(data),就可以得到想要的结果,但是我不明白为什么.无论哪种方式,特征值的比例都不会改变,但是它们对我很重要,因此我想了解发生了什么.

edit: I've noticed that the values differ by 150, which is the number of elements in the dataset. Also, the eigenvalues are supposed to add to be equal to the number of dimensions, in this case, 4. What I don't understand is why this difference is happening. If I simply divided the eigenvalues by len(data) I could get the result I want, but I don't understand why. Either way the proportion of the eigenvalues isn't altered, but they are important to me so I'd like to understand what's going on.

推荐答案

您分解了错误的矩阵.

主成分分析需要操纵特征向量/特征值 的 协方差矩阵 ,而不是数据本身.由m x n数据矩阵创建的协方差矩阵将是一个m x m矩阵,其主对角线上有1.

Principal Component Analysis requires manipulating the eigenvectors/eigenvalues of the covariance matrix, not the data itself. The covariance matrix, created from an m x n data matrix, will be an m x m matrix with ones along the main diagonal.

您确实可以使用 cov 函数,但是您需要进一步处理数据.使用类似的功能 corrcoef :

You can indeed use the cov function, but you need further manipulation of your data. It's probably a little easier to use a similar function, corrcoef:

import numpy as NP
import numpy.linalg as LA

# a simulated data set with 8 data points, each point having five features
data = NP.random.randint(0, 10, 40).reshape(8, 5)

# usually a good idea to mean center your data first:
data -= NP.mean(data, axis=0)

# calculate the covariance matrix 
C = NP.corrcoef(data, rowvar=0)
# returns an m x m matrix, or here a 5 x 5 matrix)

# now get the eigenvalues/eigenvectors of C:
eval, evec = LA.eig(C)

要获得特征向量/特征值,我没有使用SVD分解协方差矩阵, 但是,您当然可以.我的偏好是使用NumPy(或SciPy)的 eig 计算它们 LA模块-使用它比 svd 容易一点,返回值是特征向量 特征值本身,别无其他.相反,如您所知, svd 不会直接返回这些.

To get the eigenvectors/eigenvalues, I did not decompose the covariance matrix using SVD, though, you certainly can. My preference is to calculate them using eig in NumPy's (or SciPy's) LA module--it is a little easier to work with than svd, the return values are the eigenvectors and eigenvalues themselves, and nothing else. By contrast, as you know, svd doesn't return these these directly.

授予SVD函数将分解任何矩阵,而不仅仅是正方形的矩阵( eig 函数仅限于此矩阵);但是,在进行PCA时,总会有一个方阵可分解, 不管数据的形式如何.这很明显,因为矩阵 在PCA中分解的是协方差矩阵,根据定义,该矩阵始终为平方 (即,列是原始矩阵的各个数据点,同样 对于行,每个单元格都是这两个点的协方差,如所证明的 沿着主对角线向下移动-一个给定的数据点与其自身具有完美的协方差).

Granted the SVD function will decompose any matrix, not just square ones (to which the eig function is limited); however when doing PCA, you'll always have a square matrix to decompose, regardless of the form that your data is in. This is obvious because the matrix you are decomposing in PCA is a covariance matrix, which by definition is always square (i.e., the columns are the individual data points of the original matrix, likewise for the rows, and each cell is the covariance of those two points, as evidenced by the ones down the main diagonal--a given data point has perfect covariance with itself).

这篇关于我的PCA怎么了?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆