scikit学习如何以libsvm格式对稀疏数据执行PCA? [英] How can scikit-learning perform PCA on sparse data in libsvm format?
问题描述
我正在使用scikit-learning做一些降维任务. 我的训练/测试数据为libsvm格式.它是一个大型的稀疏矩阵,有50万列.
I am using scikit-learning to do some dimension reduce task. My training/test data is in the libsvm format. It is a large sparse matrix in half million columns.
我使用load_svmlight_file函数加载数据,并通过使用SparsePCA,scikit学习抛出了输入数据错误的异常.
I use load_svmlight_file function load the data, and by using SparsePCA, the scikit-learning throw out an exception of the input data error.
如何解决?
推荐答案
稀疏PCA是一种用于在密集数据上查找稀疏分解(组件具有稀疏约束)的算法.
Sparse PCA is an algorithm for finding a sparse decomposition (the components have a sparsity constraint) on dense data.
如果要对稀疏数据执行普通PCA,则应使用 sklearn.decomposition.RandomizedPCA
,它实现了可扩展的近似方法,适用于稀疏和密集数据.
If you want to do vanilla PCA on sparse data you should use sklearn.decomposition.RandomizedPCA
that implements an scalable approximate method that works on both sparse and dense data.
IIRC sklearn.decomposition.PCA
目前仅适用于密集数据.将来可以通过将稀疏数据矩阵上的SVD计算委托给arpack来添加对稀疏数据的支持.
IIRC sklearn.decomposition.PCA
only works on dense data at the moment. Support for sparse data could be added in the future by delegating the SVD computation on the sparse data matrix to arpack for instance.
编辑:如注释中所述,不推荐使用RandomizedPCA
的稀疏输入:相反,您应使用sklearn.decomposition.TruncatedSVD
,它精确地执行了RandomizedPCA
过去对稀疏数据所做的操作,但不应具有首先被称为PCA.
Edit: as noted in the comments sparse input for RandomizedPCA
is deprecated: instead you should use sklearn.decomposition.TruncatedSVD
that does precisely what RandomizedPCA
used to do on sparse data but should not have been called PCA in the first place.
要澄清的是:PCA在数学上被定义为将数据居中(将平均值去除至每个特征),然后在中心数据上应用截短的SVD.
To clarify: PCA is mathematically defined as centering the data (removing the mean value to each feature) and then applying truncated SVD on the centered data.
由于将数据居中会破坏稀疏性并强制执行通常不再适合内存的密集表示,因此通常在稀疏数据上直接执行截断的SVD(无居中).这类似于PCA,但不完全相同.这在scikit-learn中以sklearn.decomposition.TruncatedSVD
的形式实现.
As centering the data would destroy the sparsity and force a dense representation that often does not fit in memory any more, it is common to directly do truncated SVD on sparse data (without centering). This resembles PCA but it's not exactly the same. This is implemented in scikit-learn as sklearn.decomposition.TruncatedSVD
.
编辑(2019年3月):正在进行针对具有隐式居中的稀疏数据实施PCA的工作:
Edit (March 2019): There is ongoing work to implement PCA on sparse data with implicit centering: https://github.com/scikit-learn/scikit-learn/pull/12841
这篇关于scikit学习如何以libsvm格式对稀疏数据执行PCA?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!