为什么 Sklearn TruncatedSVD 的解释方差比不按降序排列? [英] Why Sklearn TruncatedSVD's explained variance ratios are not in descending order?
问题描述
为什么 Sklearn.decomposition.TruncatedSVD
的解释方差比不是按奇异值排序的?
Why Sklearn.decomposition.TruncatedSVD
's explained variance ratios are not ordered by singular values?
我的代码如下:
X = np.array([[1,1,1,1,0,0,0,0,0,0,0,0,0,0],
[0,0,1,1,1,1,1,1,1,0,0,0,0,0],
[0,0,0,0,0,0,1,1,1,1,1,1,0,0],
[0,0,0,0,0,0,0,0,0,0,1,1,1,1]])
svd = TruncatedSVD(n_components=4)
svd.fit(X4)
print(svd.explained_variance_ratio_)
print(svd.singular_values_)
和结果:
[0.17693405 0.46600983 0.21738089 0.13967523]
[3.1918354 2.39740372 1.83127499 1.30808033]
我听说奇异值意味着组件可以解释数据的程度,所以我认为解释方差比也遵循奇异值的顺序.但是比率不是按降序排列的.
I heard that a singular value means how much the component can explain data, so I think explained variance ratios also are followed by the order of singular values. But the ratios are not ordered by descending order.
有人可以解释为什么会这样吗?
Can someone explain why does it happen?
推荐答案
听说一个奇异值表示组件可以解释数据的程度
I heard that a singular value means how much the component can explain data
这适用于 PCA,但不适用于(截断的)SVD;引用相关 Github 线程 回到 explained_variance_ratio_
属性甚至不适用于 TruncatedSVD
(2014 - 重点是我的):
This holds for PCA, but it is not exactly true for (truncated) SVD; quoting from a relevant Github thread back in the day when an explained_variance_ratio_
attribute was not even available for TruncatedSVD
(2014 - emphasis mine):
保留方差不是截断SVD的精确目标函数没有居中
preserving the variance is not the exact objective function of truncated SVD without centering
因此,奇异值本身确实是按降序排列的,但是如果数据不是居中,则这不一定适用于相应的解释方差比.
So, the singular values themselves are indeed sorted in descending order, but this does not hold necessarily for the corresponding explained variance ratios if the data are not centered.
但是如果我们之前确实将数据居中,那么解释的方差比确实是按降序排序的,与奇异值本身相对应:
But if we do center the data before, then the explained variance ratios come out sorted in descending order indeed, in correspondence with the singular values themselves:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
sc = StandardScaler()
Xs = sc.fit_transform(X) # X data from the question here
svd = TruncatedSVD(n_components=4)
svd.fit(Xs)
print(svd.explained_variance_ratio_)
print(svd.singular_values_)
结果:
[4.60479851e-01 3.77856541e-01 1.61663608e-01 8.13905807e-66]
[5.07807756e+00 4.59999633e+00 3.00884730e+00 8.21430014e-17]
对于数学 &PCA & 中居中和非居中数据之间的计算差异SVD 计算,参见 居中如何影响 PCA(对于 SVD 和特征分解)?
For the mathematical & computational differences between centered and non-centered data in PCA & SVD calculations, see How does centering make a difference in PCA (for SVD and eigen decomposition)?
关于 TruncatedSVD
本身的使用,这里是用户 ogrisel(scikit-learn 贡献者)在 PCA 和 TruncatedSVD 的 scikit-learn 实现之间的区别:
Regarding the use of TruncatedSVD
itself, here is user ogrisel again (scikit-learn contributor) in a relevant answer in Difference between scikit-learn implementations of PCA and TruncatedSVD:
在实践中 TruncatedSVD
在大型稀疏数据集上很有用,这些数据集不能在不使内存使用量爆炸的情况下居中.
In practice
TruncatedSVD
is useful on large sparse datasets which cannot be centered without making the memory usage explode.
因此,您在这里选择使用 TruncatedSVD
的原因尚不清楚,但是,如果您没有导致内存问题的过大数据集,我想您应该恢复到 PCA反而.
So, it's not exactly clear why you have selected to use TruncatedSVD
here, but, if you don't have a too-large dataset that causes memory issues, I guess you should revert to PCA instead.
这篇关于为什么 Sklearn TruncatedSVD 的解释方差比不按降序排列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!