Scikit-Learn PCA或Numpy Eigen分解中的错误? [英] Bug in Scikit-Learn PCA or in Numpy Eigen Decomposition?

查看:114
本文介绍了Scikit-Learn PCA或Numpy Eigen分解中的错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含400个特征的数据集.

I have a dataset with 400 features.

我做了什么:

# approach 1
d_cov = np.cov(d_train.transpose())
eigens, mypca = LA.eig(d_cov) # assume sort by eigen value also/ LA = numpy linear algebra

# approach 2 
pca = PCA(n_components=300)
d_fit = pca.fit_transform(d_train)
pc = pca.components_

现在,这两个应该相同,对吗?因为PCA只是协方差矩阵的特征分解.

Now, these two should be the same, right? as PCA is just the eigendecomposition of the covariance matrix.

但是对于我来说,这些有很大不同吗?

But these are very different in my case?

那怎么可能,我在上面犯了任何错误?

How could that be, I am doing any mistake above?

比较方差:

import numpy as np
LA = np.linalg
d_train = np.random.randn(100, 10)
d_cov = np.cov(d_train.transpose())
eigens, mypca = LA.eig(d_cov)

import matplotlib.pyplot as plt


from sklearn.decomposition import PCA
pca =  PCA(n_components=10)
d_fit = pca.fit_transform(d_train)
pc = pca.components_
ve = pca.explained_variance_
#mypca[0,:], pc[0,:] pc.transpose()[0,:]

plt.plot(list(range(len(eigens))), [ x.transpose().dot(d_cov).dot(x) for x,y  in zip(mypca, eigens) ])
plt.plot(list(range(len(ve))), ve)
plt.show()

print(mypca, '\n---\n' , pc)

推荐答案

您需要更仔细地阅读文档. numpy的文档很棒而且非常详尽,很多时候,只有阅读它才能找到解决问题的方法.

You need to read the doc more carefully. numpy's doc is great and very thorough, very often you'll find the solution to your problem only by reading it.

这是您代码的修改版本(在代码段顶部导入,使用.T代替.transpose(),pep8.)

Here is a modified version of your code (import on top of snippet, use of .T instead of .transpose(), pep8.)

import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA

from numpy import linalg as LA

d_train = np.random.randn(100, 10)
d_cov = np.cov(d_train.transpose())
eigens, mypca = LA.eig(d_cov)

pca = PCA(n_components=10)
d_fit = pca.fit_transform(d_train)
pc = pca.components_
explained = pca.explained_variance_

my_explained = np.sort([x.T.dot(d_cov).dot(x) for x in mypca.T])[::-1]

plt.close('all')
plt.figure()
plt.plot(my_explained, label='me')
plt.plot(explained, label='sklearn')
plt.legend()
plt.show(block=False)

两条曲线完全相同. 重要的是我要遍历my_pca.T,而不是my_pca.

The two curves are exactly the same. The important thing is that I iterate over my_pca.T, not my_pca.

Signature: np.linalg.eig(a)
Docstring:
Compute the eigenvalues and right eigenvectors of a square array.

Parameters
----------
a : (..., M, M) array
    Matrices for which the eigenvalues and right eigenvectors will
    be computed

Returns
-------
w : (..., M) array
    # not important for you

v : (..., M, M) array
    The normalized (unit "length") eigenvectors, such that the
    column ``v[:,i]`` is the eigenvector corresponding to the
    eigenvalue ``w[i]``.

特征向量以my_pca的列而不是行的形式返回. for x in my_pca正在遍历行.

The eigenvectors are returned as columns of my_pca, not rows. for x in my_pca was iterating over rows.

这篇关于Scikit-Learn PCA或Numpy Eigen分解中的错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆