从sklearn PCA获得特征值和向量 [英] Obtain eigen values and vectors from sklearn PCA

查看:2069
本文介绍了从sklearn PCA获得特征值和向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何获取PCA应用程序的特征值和特征向量?

How I can get the the eigen values and eigen vectors of the PCA application?

from sklearn.decomposition import PCA
clf=PCA(0.98,whiten=True)      #converse 98% variance
X_train=clf.fit_transform(X_train)
X_test=clf.transform(X_test)

我在文档.

1.我无法"理解这里的不同结果.

1.I am "not" able to comprehend the different results here.

修改:

def pca_code(data):
    #raw_implementation
    var_per=.98
    data-=np.mean(data, axis=0)
    data/=np.std(data, axis=0)
    cov_mat=np.cov(data, rowvar=False)
    evals, evecs = np.linalg.eigh(cov_mat)
    idx = np.argsort(evals)[::-1]
    evecs = evecs[:,idx]
    evals = evals[idx]
    variance_retained=np.cumsum(evals)/np.sum(evals)
    index=np.argmax(variance_retained>=var_per)
    evecs = evecs[:,:index+1]
    reduced_data=np.dot(evecs.T, data.T).T
    print(evals)
    print("_"*30)
    print(evecs)
    print("_"*30)
    #using scipy package
    clf=PCA(var_per)
    X_train=data.T
    X_train=clf.fit_transform(X_train)
    print(clf.explained_variance_)
    print("_"*30)
    print(clf.components_)
    print("__"*30)

  1. 我希望获得所有特征值和特征向量,而不仅仅是获得具有收敛条件的约简集.

推荐答案

您的实现

您正在计算相关矩阵的特征向量,即归一化变量的协方差矩阵.
data/=np.std(data, axis=0)不是经典PCA的一部分,我们仅将变量居中. 因此,sklearn PCA 不具有预先缩放数据的功能.

Your implementation

You are computing the eigenvectors of the correlation matrix, that is the covariance matrix of the normalized variables.
data/=np.std(data, axis=0) is not part of the classic PCA, we only center the variables. So the sklearn PCA does not feature scale the data beforehand.

除此以外,如果我们抽象出您提供的代码未运行的事实,您将走上正确的道路;). 您只对行/列布局感到困惑.老实说,我认为从X = data.T开始并从那时开始仅使用X会容易得多.我在帖子的末尾添加了已修复"的代码.

Apart from that you are on the right track, if we abstract the fact that the code you provided did not run ;). You only got confused with the row/column layouts. Honestly I think it's much easier to start with X = data.T and work only with X from there on. I added your code 'fixed' at the end of the post.

您已经注意到,可以使用clf.components_来获取特征向量.

You already noted that you can get the eigenvectors using clf.components_.

因此,您具有主要组成部分.它们是协方差矩阵$ X ^ T X $的特征向量.

So you have the principal components. They are eigenvectors of the covariance matrix $X^T X$.

从那里检索特征值的一种方法是将此矩阵应用于每个主要成分,并将结果投影到该成分上. 令v_​​1为第一个主成分,令lambda_1为关联的特征值.我们有:
,因此: ,因为. (x,y)向量x和y的标量积.

A way to retrieve the eigenvalues from there is to apply this matrix to each principal components and project the results onto the component. Let v_1 be the first principal component and lambda_1 the associated eigenvalue. We have:
and thus: since . (x, y) the scalar product of vectors x and y.

返回Python,您可以执行以下操作:

Back in Python you can do:

n_samples = X.shape[0]
# We center the data and compute the sample covariance matrix.
X -= np.mean(X, axis=0)
cov_matrix = np.dot(X.T, X) / n_samples
for eigenvector in pca.components_:
    print(np.dot(eigenvector.T, np.dot(cov_matrix, eigenvector)))

您将获得与特征向量关联的特征值. 好吧,在我的测试中,结果证明不适用于最后几个特征值,但我将其归因于我缺乏数值稳定性方面的技能.

And you get the eigenvalue associated with the eigenvector. Well, in my tests it turned out not to work with the couple last eigenvalues but I'd attribute that to my absence of skills in numerical stability.

现在,这不是获取特征值的最佳方法,但是很高兴知道它们来自何处.
特征值表示特征向量方向上的方差.因此,您可以通过pca.explained_variance_属性获取它们:

Now that's not the best way to get the eigenvalues but it's nice to know where they come from.
The eigenvalues represent the variance in the direction of the eigenvector. So you can get them through the pca.explained_variance_ attribute:

eigenvalues = pca.explained_variance_

这是一个可重现的示例,其中显示了每种方法获得的特征值:

Here is a reproducible example that prints the eigenvalues you get with each method:

import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import make_classification


X, y = make_classification(n_samples=1000)
n_samples = X.shape[0]

pca = PCA()
X_transformed = pca.fit_transform(X)

# We center the data and compute the sample covariance matrix.
X_centered = X - np.mean(X, axis=0)
cov_matrix = np.dot(X_centered.T, X_centered) / n_samples
eigenvalues = pca.explained_variance_
for eigenvalue, eigenvector in zip(eigenvalues, pca.components_):    
    print(np.dot(eigenvector.T, np.dot(cov_matrix, eigenvector)))
    print(eigenvalue)

您的原始代码已固定

如果运行它,您将看到值是一致的.它们并不完全相等,因为numpy和scikit-learn在此处使用的算法不同.
如上所述,最主要的是您使用的是相关矩阵而不是协方差.另外,您还从numpy获取了转置的特征向量,这使它非常混乱.

Your original code, fixed

If you run it you'll see the values are consistent. They're not exactly equal because numpy and scikit-learn are not using the same algorithm here.
The main thing was that you were using correlation matrix instead of covariance, as mentioned above. Also you were getting the transposed eigenvectors from numpy which made it very confusing.

import numpy as np
from scipy.stats.mstats import zscore
from sklearn.decomposition import PCA

def pca_code(data):
    #raw_implementation
    var_per=.98
    data-=np.mean(data, axis=0)
    # data/=np.std(data, axis=0)
    cov_mat=np.cov(data, rowvar=False)
    evals, evecs = np.linalg.eigh(cov_mat)
    idx = np.argsort(evals)[::-1]
    evecs = evecs[:,idx]
    evals = evals[idx]
    variance_retained=np.cumsum(evals)/np.sum(evals)
    index=np.argmax(variance_retained>=var_per)
    evecs = evecs[:,:index+1]
    reduced_data=np.dot(evecs.T, data.T).T
    print("evals", evals)
    print("_"*30)
    print(evecs.T[1, :])
    print("_"*30)
    #using scipy package
    clf=PCA(var_per)
    X_train=data
    X_train=clf.fit_transform(X_train)
    print(clf.explained_variance_)
    print("_"*30)
    print(clf.components_[1,:])
    print("__"*30)

希望这会有所帮助,请随时进行澄清.

Hope this helps, feel free to ask for clarifications.

这篇关于从sklearn PCA获得特征值和向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆