scikit-learn中的PCA投影和重构 [英] PCA projection and reconstruction in scikit-learn
问题描述
我可以通过以下代码在scikit中执行PCA: X_train有279180行和104列.
I can perform PCA in scikit by code below: X_train has 279180 rows and 104 columns.
from sklearn.decomposition import PCA
pca = PCA(n_components=30)
X_train_pca = pca.fit_transform(X_train)
现在,当我要将特征向量投影到特征空间上时,我必须执行以下操作:
Now, when I want to project the eigenvectors onto feature space, I must do following:
""" Projection """
comp = pca.components_ #30x104
com_tr = np.transpose(pca.components_) #104x30
proj = np.dot(X_train,com_tr) #279180x104 * 104x30 = 297180x30
但是我对这一步骤感到犹豫,因为Scikit 文档说:
But I am hesitating with this step, because Scikit documentation says:
components_:数组,[n_components,n_features]
components_: array, [n_components, n_features]
要素空间中的主轴 ,代表数据中最大方差的方向.
Principal axes in feature space, representing the directions of maximum variance in the data.
在我看来,它已经被投影了,但是当我检查源代码时,它仅返回特征向量.
It seems to me, that it is already projected, but when I checked the source code, it returns only the eigenvectors.
什么是正确的投影方式?
最终,我的目标是计算重建的MSE.
Ultimately, I am aiming to calculate the MSE of reconstruction.
""" Reconstruct """
recon = np.dot(proj,comp) #297180x30 * 30x104 = 279180x104
""" MSE Error """
print "MSE = %.6G" %(np.mean((X_train - recon)**2))
推荐答案
您可以
proj = pca.inverse_transform(X_train_pca)
那样,您不必担心如何进行乘法.
That way you do not have to worry about how to do the multiplications.
在pca.fit_transform
或pca.transform
之后获得的结果通常是每个样本的载荷",这意味着您需要使用components_
的线性组合来最好地描述每个组分的多少(特征空间中的主轴).
What you obtain after pca.fit_transform
or pca.transform
are what is usually called the "loadings" for each sample, meaning how much of each component you need to describe it best using a linear combination of the components_
(the principal axes in feature space).
您要瞄准的投影回到原始信号空间.这意味着您需要使用组件和负载回到信号空间.
The projection you are aiming at is back in the original signal space. This means that you need to go back into signal space using the components and the loadings.
因此,这里要消除歧义的三个步骤.在这里,您将逐步了解如何使用PCA对象以及该对象的实际计算方式:
So there are three steps to disambiguate here. Here you have, step by step, what you can do using the PCA object and how it is actually calculated:
-
pca.fit
估计分量(使用居中的Xtrain上的SVD):
pca.fit
estimates the components (using an SVD on the centered Xtrain):
from sklearn.decomposition import PCA
import numpy as np
from numpy.testing import assert_array_almost_equal
#Should this variable be X_train instead of Xtrain?
X_train = np.random.randn(100, 50)
pca = PCA(n_components=30)
pca.fit(X_train)
U, S, VT = np.linalg.svd(X_train - X_train.mean(0))
assert_array_almost_equal(VT[:30], pca.components_)
pca.transform
根据您的描述计算载荷
pca.transform
calculates the loadings as you describe
X_train_pca = pca.transform(X_train)
X_train_pca2 = (X_train - pca.mean_).dot(pca.components_.T)
assert_array_almost_equal(X_train_pca, X_train_pca2)
pca.inverse_transform
获取对您感兴趣的信号空间中的分量的投影
pca.inverse_transform
obtains the projection onto components in signal space you are interested in
X_projected = pca.inverse_transform(X_train_pca)
X_projected2 = X_train_pca.dot(pca.components_) + pca.mean_
assert_array_almost_equal(X_projected, X_projected2)
您现在可以评估投影损失
You can now evaluate the projection loss
loss = ((X_train - X_projected) ** 2).mean()
这篇关于scikit-learn中的PCA投影和重构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!