scikit-learn中的PCA投影和重构 [英] PCA projection and reconstruction in scikit-learn

查看:357
本文介绍了scikit-learn中的PCA投影和重构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以通过以下代码在scikit中执行PCA: X_train有279180行和104列.

I can perform PCA in scikit by code below: X_train has 279180 rows and 104 columns.

from sklearn.decomposition import PCA
pca = PCA(n_components=30)
X_train_pca = pca.fit_transform(X_train)

现在,当我要将特征向量投影到特征空间上时,我必须执行以下操作:

Now, when I want to project the eigenvectors onto feature space, I must do following:

""" Projection """
comp = pca.components_ #30x104
com_tr = np.transpose(pca.components_) #104x30
proj = np.dot(X_train,com_tr) #279180x104 * 104x30 = 297180x30

但是我对这一步骤感到犹豫,因为Scikit 文档说:

But I am hesitating with this step, because Scikit documentation says:

components_:数组,[n_components,n_features]

components_: array, [n_components, n_features]

要素空间中的主轴 ,代表数据中最大方差的方向.

Principal axes in feature space, representing the directions of maximum variance in the data.

在我看来,它已经被投影了,但是当我检查源代码时,它仅返回特征向量.

It seems to me, that it is already projected, but when I checked the source code, it returns only the eigenvectors.

什么是正确的投影方式?

最终,我的目标是计算重建的MSE.

Ultimately, I am aiming to calculate the MSE of reconstruction.

""" Reconstruct """
recon = np.dot(proj,comp) #297180x30 * 30x104 = 279180x104

"""  MSE Error """
print "MSE = %.6G" %(np.mean((X_train - recon)**2))

推荐答案

您可以

proj = pca.inverse_transform(X_train_pca)

那样,您不必担心如何进行乘法.

That way you do not have to worry about how to do the multiplications.

pca.fit_transformpca.transform之后获得的结果通常是每个样本的载荷",这意味着您需要使用components_的线性组合来最好地描述每个组分的多少(特征空间中的主轴).

What you obtain after pca.fit_transform or pca.transform are what is usually called the "loadings" for each sample, meaning how much of each component you need to describe it best using a linear combination of the components_ (the principal axes in feature space).

您要瞄准的投影回到原始信号空间.这意味着您需要使用组件和负载回到信号空间.

The projection you are aiming at is back in the original signal space. This means that you need to go back into signal space using the components and the loadings.

因此,这里要消除歧义的三个步骤.在这里,您将逐步了解如何使用PCA对象以及该对象的实际计算方式:

So there are three steps to disambiguate here. Here you have, step by step, what you can do using the PCA object and how it is actually calculated:

  1. pca.fit估计分量(使用居中的Xtrain上的SVD):

  1. pca.fit estimates the components (using an SVD on the centered Xtrain):

from sklearn.decomposition import PCA
import numpy as np
from numpy.testing import assert_array_almost_equal

#Should this variable be X_train instead of Xtrain?
X_train = np.random.randn(100, 50)

pca = PCA(n_components=30)
pca.fit(X_train)

U, S, VT = np.linalg.svd(X_train - X_train.mean(0))

assert_array_almost_equal(VT[:30], pca.components_)

  • pca.transform根据您的描述计算载荷

  • pca.transform calculates the loadings as you describe

    X_train_pca = pca.transform(X_train)
    
    X_train_pca2 = (X_train - pca.mean_).dot(pca.components_.T)
    
    assert_array_almost_equal(X_train_pca, X_train_pca2)
    

  • pca.inverse_transform获取对您感兴趣的信号空间中的分量的投影

  • pca.inverse_transform obtains the projection onto components in signal space you are interested in

    X_projected = pca.inverse_transform(X_train_pca)
    X_projected2 = X_train_pca.dot(pca.components_) + pca.mean_
    
    assert_array_almost_equal(X_projected, X_projected2)
    

  • 您现在可以评估投影损失

    You can now evaluate the projection loss

    loss = ((X_train - X_projected) ** 2).mean()
    

    这篇关于scikit-learn中的PCA投影和重构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆