在 sklearn.decomposition.PCA 中,为什么 components_ 是负的? [英] In sklearn.decomposition.PCA, why are components_ negative?

查看:16
本文介绍了在 sklearn.decomposition.PCA 中,为什么 components_ 是负的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在努力跟随 Abdi &Williams -

随着 s_i 以递减方式排序,然后您可以看到您可以更改 u_1 和 v_1 的符号(即翻转"),减号将取消,因此公式仍然成立.

这表明 SVD 是唯一的直到左右奇异向量对的符号发生变化.

由于 PCA 只是 X 的 SVD(或 X^ op X 的特征值分解),因此不能保证每次执行时不会在同一个 X 上返回不同的结果.可以理解,scikit learn 实现想要避免这种情况:他们保证返回的左右奇异向量(存储在 U 和 V 中)总是相同的,通过强加(这是任意的)u_i 的最大系数绝对值是正的.

如您所见,阅读 the来源:首先他们用 linalg.svd() 计算 U 和 V.然后,对于每个向量 u_i(即 U 行),如果其绝对值中最大的元素为正,则它们不执行任何操作.否则,他们将 u_i 更改为 - u_i,并将相应的左奇异向量 v_i 更改为 - v_i.如前所述,这不会改变 SVD 公式,因为减号抵消了.但是,现在可以保证在此处理后返回的 U 和 V 始终相同,因为符号上的不确定性已被删除.

I'm trying to follow along with Abdi & Williams - Principal Component Analysis (2010) and build principal components through SVD, using numpy.linalg.svd.

When I display the components_ attribute from a fitted PCA with sklearn, they're of the exact same magnitude as the ones that I've manually computed, but some (not all) are of opposite sign. What's causing this?

Update: my (partial) answer below contains some additional info.

Take the following example data:

from pandas_datareader.data import DataReader as dr
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

# sample data - shape (20, 3), each column standardized to N~(0,1)
rates = scale(dr(['DGS5', 'DGS10', 'DGS30'], 'fred', 
           start='2017-01-01', end='2017-02-01').pct_change().dropna())

# with sklearn PCA:
pca = PCA().fit(rates)
print(pca.components_)
[[-0.58365629 -0.58614003 -0.56194768]
 [-0.43328092 -0.36048659  0.82602486]
 [-0.68674084  0.72559581 -0.04356302]]

# compare to the manual method via SVD:
u, s, Vh = np.linalg.svd(np.asmatrix(rates), full_matrices=False)
print(Vh)
[[ 0.58365629  0.58614003  0.56194768]
 [ 0.43328092  0.36048659 -0.82602486]
 [-0.68674084  0.72559581 -0.04356302]]

# odd: some, but not all signs reversed
print(np.isclose(Vh, -1 * pca.components_))
[[ True  True  True]
 [ True  True  True]
 [False False False]]

解决方案

As you figured out in your answer, the results of a singular value decomposition (SVD) are not unique in terms of singular vectors. Indeed, if the SVD of X is sum_1^r s_i u_i v_i^ op :

with the s_i ordered in decreasing fashion, then you can see that you can change the sign (i.e., "flip") of say u_1 and v_1, the minus signs will cancel so the formula will still hold.

This shows that the SVD is unique up to a change in sign in pairs of left and right singular vectors.

Since the PCA is just a SVD of X (or an eigenvalue decomposition of X^ op X), there is no guarantee that it does not return different results on the same X every time it is performed. Understandably, scikit learn implementation wants to avoid this: they guarantee that the left and right singular vectors returned (stored in U and V) are always the same, by imposing (which is arbitrary) that the largest coefficient of u_i in absolute value is positive.

As you can see reading the source: first they compute U and V with linalg.svd(). Then, for each vector u_i (i.e, row of U), if its largest element in absolute value is positive, they don't do anything. Otherwise, they change u_i to - u_i and the corresponding left singular vector, v_i, to - v_i. As told earlier, this does not change the SVD formula since the minus sign cancel out. However, now it is guaranteed that the U and V returned after this processing are always the same, since the indetermination on the sign has been removed.

这篇关于在 sklearn.decomposition.PCA 中,为什么 components_ 是负的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆