为什么 Sklearn TruncatedSVD 的解释方差比不按降序排列? [英] Why Sklearn TruncatedSVD's explained variance ratios are not in descending order?

查看:68
本文介绍了为什么 Sklearn TruncatedSVD 的解释方差比不按降序排列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么 Sklearn.decomposition.TruncatedSVD 的解释方差比不是按奇异值排序的?

Why Sklearn.decomposition.TruncatedSVD's explained variance ratios are not ordered by singular values?

我的代码如下:

X = np.array([[1,1,1,1,0,0,0,0,0,0,0,0,0,0],
           [0,0,1,1,1,1,1,1,1,0,0,0,0,0],
           [0,0,0,0,0,0,1,1,1,1,1,1,0,0],
           [0,0,0,0,0,0,0,0,0,0,1,1,1,1]])
svd = TruncatedSVD(n_components=4)
svd.fit(X4)
print(svd.explained_variance_ratio_)
print(svd.singular_values_)

和结果:

[0.17693405 0.46600983 0.21738089 0.13967523]
[3.1918354  2.39740372 1.83127499 1.30808033]

我听说奇异值意味着组件可以解释数据的程度,所以我认为解释方差比也遵循奇异值的顺序.但是比率不是按降序排列的.

I heard that a singular value means how much the component can explain data, so I think explained variance ratios also are followed by the order of singular values. But the ratios are not ordered by descending order.

有人可以解释为什么会这样吗?

Can someone explain why does it happen?

推荐答案

听说一个奇异值表示组件可以解释数据的程度

I heard that a singular value means how much the component can explain data

这适用于 PCA,但不适用于(截断的)SVD;引用相关 Github 线程 回到 explained_variance_ratio_ 属性甚至不适用于 TruncatedSVD(2014 - 重点是我的):

This holds for PCA, but it is not exactly true for (truncated) SVD; quoting from a relevant Github thread back in the day when an explained_variance_ratio_ attribute was not even available for TruncatedSVD (2014 - emphasis mine):

保留方差不是截断SVD的精确目标函数没有居中

preserving the variance is not the exact objective function of truncated SVD without centering

因此,奇异值本身确实是按降序排列的,但是如果数据不是居中,则这不一定适用于相应的解释方差比.

So, the singular values themselves are indeed sorted in descending order, but this does not hold necessarily for the corresponding explained variance ratios if the data are not centered.

但是如果我们之前确实将数据居中,那么解释的方差比确实是按降序排序的,与奇异值本身相对应:

But if we do center the data before, then the explained variance ratios come out sorted in descending order indeed, in correspondence with the singular values themselves:

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD

sc = StandardScaler()
Xs = sc.fit_transform(X) # X data from the question here

svd = TruncatedSVD(n_components=4)
svd.fit(Xs)

print(svd.explained_variance_ratio_)
print(svd.singular_values_)

结果:

[4.60479851e-01 3.77856541e-01 1.61663608e-01 8.13905807e-66]
[5.07807756e+00 4.59999633e+00 3.00884730e+00 8.21430014e-17]

对于数学 &PCA & 中居中和非居中数据之间的计算差异SVD 计算,参见 居中如何影响 PCA(对于 SVD 和特征分解)?

For the mathematical & computational differences between centered and non-centered data in PCA & SVD calculations, see How does centering make a difference in PCA (for SVD and eigen decomposition)?

关于 TruncatedSVD 本身的使用,这里是用户 ogrisel(scikit-learn 贡献者)在 PCA 和 TruncatedSVD 的 scikit-learn 实现之间的区别:

Regarding the use of TruncatedSVD itself, here is user ogrisel again (scikit-learn contributor) in a relevant answer in Difference between scikit-learn implementations of PCA and TruncatedSVD:

在实践中 TruncatedSVD 在大型稀疏数据集上很有用,这些数据集不能在不使内存使用量爆炸的情况下居中.

In practice TruncatedSVD is useful on large sparse datasets which cannot be centered without making the memory usage explode.

因此,您在这里选择使用 TruncatedSVD 的原因尚不清楚,但是,如果您没有导致内存问题的过大数据集,我想您应该恢复到 PCA反而.

So, it's not exactly clear why you have selected to use TruncatedSVD here, but, if you don't have a too-large dataset that causes memory issues, I guess you should revert to PCA instead.

这篇关于为什么 Sklearn TruncatedSVD 的解释方差比不按降序排列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆