scikit-learn PCA:矩阵变换产生带有翻转符号的PC估计 [英] scikit-learn PCA: matrix transformation produces PC estimates with flipped signs

查看:121
本文介绍了scikit-learn PCA:矩阵变换产生带有翻转符号的PC估计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用scikit-learn在此数据集上执行PCA. scikit-learn文档声明

I'm using scikit-learn to perform PCA on this dataset. The scikit-learn documentation states that

由于实现奇异值分解的微妙之处 (SVD),用于此实现中, 同一矩阵可以导致主成分的符号翻转 (方向改变).因此,务必始终使用 相同的估算器对象,以一致的方式转换数据.

Due to implementation subtleties of the Singular Value Decomposition (SVD), which is used in this implementation, running fit twice on the same matrix can lead to principal components with signs flipped (change in direction). For this reason, it is important to always use the same estimator object to transform data in a consistent fashion.

问题是,我不认为我使用的估算器对象不同,但是与SAS PROC PRINCOMP过程的结果相比,我的某些PC的信号是翻转的.

The problem is that I don't think that I'm using different estimator objects, but the signs of some of my PCs are flipped, when compared to results in SAS's PROC PRINCOMP procedure.

对于数据集中的第一个观察结果,SAS PC为:

For the first observation in the dataset, the SAS PCs are:

PC1      PC2      PC3       PC4      PC5
2.0508   1.9600   -0.1663   0.2965   -0.0121

从scikit-learn中,我得到了以下内容(幅度非常接近):

From scikit-learn, I get the following (which are very close in magnitude):

PC1      PC2      PC3       PC4      PC5
-2.0536  -1.9627  -0.1666   -0.297   -0.0122

这就是我在做什么:

import pandas as pd
import numpy  as np
from sklearn.decomposition.pca import PCA

sourcef = pd.read_csv('C:/mydata.csv')
frame = pd.DataFrame(sourcef)

# Some pandas evals, regressions, etc... that I'm not showing
# but not affecting the matrix

# Make sure we are working with the proper data -- drop the response variable
cols = [col for col in frame.columns if col not in ['response']]

# Separate out the data matrix from the response variable vector 
# into numpy arrays
frame2_X = frame[cols].values
frame2_y = frame['response'].values

# Standardize the values
X_means = np.mean(frame2_X,axis=0)
X_stds  = np.std(frame2_X,axis=0)

y_mean = np.mean(frame2_y)
y_std  = np.std(frame2_y)

frame2_X_stdz = np.copy(frame2_X)
frame2_y_stdz = frame2_y.astype(numpy.float32, copy=True)

for (x,y), value in np.ndenumerate(frame2_X_stdz):
    frame2_X_stdz[x][y] = (value - X_means[y])/X_stds[y]

for index, value in enumerate(frame2_y_stdz):
    frame2_y_stdz[index] = (float(value) - y_mean)/y_std

# Show the first 5 elements of the standardized values, to verify
print frame2_X_stdz[:,0][:5]

# Show the first 5 lines from the standardized response vector, to verify
print frame2_y_stdz[:5]

可以结帐了:

[ 0.9508 -0.5847 -0.2797 -0.4039 -0.598 ]
[ 1.0726 -0.5009 -0.0942 -0.1187 -0.8043]

继续...

# Create a PCA object
pca = PCA()
pca.fit(frame2_X_stdz)

# Create the matrix of PC estimates
pca.transform(frame2_X_stdz)

这是最后一步的输出:

Out[16]: array([[-2.0536, -1.9627, -0.1666, -0.297 , -0.0122],
       [ 1.382 , -0.382 , -0.5692, -0.0257, -0.0509],
       [ 0.4342,  0.611 ,  0.2701,  0.062 , -0.011 ],
       ..., 
       [ 0.0422,  0.7251, -0.1926,  0.0089,  0.0005],
       [ 1.4502, -0.7115, -0.0733,  0.0013, -0.0557],
       [ 0.258 ,  0.3684,  0.1873,  0.0403,  0.0042]])

我尝试用pca.fit_transform()替换pca.fit()pca.transform(),但是最终得到了相同的结果.

I've tried it by replacing the pca.fit() and pca.transform() with pca.fit_transform(), but I end up with the same results.

在我的PC上出现标志翻转的情况下,我在做什么错了?

What am I doing wrong here that I'm getting PCs with the signs flipped?

推荐答案

您没有做错任何事情.

文档警告您的是,重复调用fit可能会产生不同的主要成分-而不是它们与另一个PCA实现的关系.

What the documentation is warning you about is that repeated calls to fit may yield different principal components - not how they relate to another PCA implementation.

在所有组件上带有翻转符号不会使结果错误-只要满足定义,结果就正确(选择每个组件以捕获最大方差)在数据中).就目前情况而言,您所获得的投影似乎只是被镜像了-它仍然满足定义,因此是正确的.

Having a flipped sign on all components doesn't make the result wrong - the result is right as long as it fulfills the definition (each component is chosen such that it captures the maximum amount of variance in the data). As it stands, it seems the projection you got is simply mirrored - it still fulfills the definition, and is, thus, correct.

如果在正确性之下,您担心实现之间的一致性,则可以在必要时将组成部分乘以-1.

If, beneath correctness, you're worried about consistency between implementations, you can simply multiply the components by -1, when it's necessary.

这篇关于scikit-learn PCA:矩阵变换产生带有翻转符号的PC估计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆