使用 sklearn 在大型稀疏矩阵上执行 PCA [英] Performing PCA on large sparse matrix by using sklearn
问题描述
我正在尝试在巨大的稀疏矩阵上应用 PCA,在下面的链接中它说 sklearn 的 randomPCA 可以处理 scipy 稀疏格式的稀疏矩阵.在非常大的稀疏矩阵上应用 PCA
但是,我总是出错.有人可以指出我做错了什么.
输入矩阵 'X_train' 包含 float64 中的数字:
>>>type(X_train)<类'scipy.sparse.csr.csr_matrix'>>>>X_train.shape(2365436, 1617899)>>>X_train.ndim2>>>X_train[0]<1x1617899 '<type'numpy.float64'>'类型的稀疏矩阵以压缩稀疏行格式存储 81 个元素>
我正在尝试:
>>>from sklearn.decomposition import RandomizedPCA>>>pca = RandomizedPCA()>>>pca.fit(X_train)回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/decomposition/pca.py",第567行,合适self._fit(check_array(X))check_array 中的文件/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py",第 334 行复制,force_all_finite)文件/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py",第239行,_ensure_sparse_formatraise TypeError('一个稀疏矩阵被传递,但密集'类型错误:传递了稀疏矩阵,但需要密集数据.使用 X.toarray() 转换为密集的 numpy 数组.
如果我尝试转换为密集矩阵,我想我的内存不足.
<预><代码>>>>pca.fit(X_train.toarray())回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/compressed.py",第949行,在toarray中返回 self.tocoo(copy=False).toarray(order=order, out=out)文件/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/coo.py",第274行,在toarray中B = self._process_toarray_args(order, out)文件/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/base.py",第800行,_process_toarray_args返回 np.zeros(self.shape, dtype=self.dtype, order=order)内存错误由于 PCA 的性质,即使输入是稀疏矩阵,输出也不是.你可以用一个简单的例子来检查它:
<预><代码>>>>从 sklearn.decomposition 导入 TruncatedSVD>>>from scipy import sparse as sp创建一个随机稀疏矩阵,其中 0.01% 的数据为非零值.
<预><代码>>>>X = sp.rand(1000, 1000, 密度=0.0001)对其应用 PCA:
<预><代码>>>>clf = TruncatedSVD(100)>>>Xpca = clf.fit_transform(X)现在,检查结果:
<预><代码>>>>类型(X)scipy.sparse.coo.coo_matrix>>>类型(Xpca)numpy.ndarray>>>打印 np.count_nonzero(Xpca), Xpca.size95000、100000然而,这表明 95000 个条目是非零的,
<预><代码>>>>np.isclose(Xpca, 0, atol=1e-15).sum(), Xpca.size99481、10000099481 个元素是接近 0
(<1e-15
),但不是 0
.
简而言之,这意味着对于 PCA,即使输入是稀疏矩阵,输出也不是.因此,如果您尝试从矩阵中提取 100,000,000 (1e8
) 个分量,您最终会得到一个 1e8 x n_features
(在您的示例中为 1e8 x 1617899
) 稠密矩阵,当然不能保存在内存中.
我不是专业的统计学家,但我相信目前没有使用 scikit-learn 解决这个问题的解决方法,因为这不是 scikit-learn 实现的问题,只是他们的 Sparse PCA 的数学定义(通过sparse SVD) 使结果密集.
可能对您有用的唯一解决方法是让您从少量组件开始,然后增加它,直到您在可以保留在内存中的数据与解释的数据百分比之间取得平衡(您可以按如下方式计算):
<预><代码>>>>clf.explained_variance_ratio_.sum()I am trying to apply PCA on huge sparse matrix, in the following link it says that randomizedPCA of sklearn can handle sparse matrix of scipy sparse format. Apply PCA on very large sparse matrix
However, I always get error. Can someone point out what I am doing wrong.
Input matrix 'X_train' contains numbers in float64:
>>>type(X_train)
<class 'scipy.sparse.csr.csr_matrix'>
>>>X_train.shape
(2365436, 1617899)
>>>X_train.ndim
2
>>>X_train[0]
<1x1617899 sparse matrix of type '<type 'numpy.float64'>'
with 81 stored elements in Compressed Sparse Row format>
I am trying to do:
>>>from sklearn.decomposition import RandomizedPCA
>>>pca = RandomizedPCA()
>>>pca.fit(X_train)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 567, in fit
self._fit(check_array(X))
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 334, in check_array
copy, force_all_finite)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 239, in _ensure_sparse_format
raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
if I try to convert to dense matrix, I think I am out of memory.
>>> pca.fit(X_train.toarray())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 949, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/coo.py", line 274, in toarray
B = self._process_toarray_args(order, out)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/base.py", line 800, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
Due to the nature of the PCA, even if the input is an sparse matrix, the output is not. You can check it with a quick example:
>>> from sklearn.decomposition import TruncatedSVD
>>> from scipy import sparse as sp
Create a random sparse matrix with 0.01% of its data as non-zeros.
>>> X = sp.rand(1000, 1000, density=0.0001)
Apply PCA to it:
>>> clf = TruncatedSVD(100)
>>> Xpca = clf.fit_transform(X)
Now, check the results:
>>> type(X)
scipy.sparse.coo.coo_matrix
>>> type(Xpca)
numpy.ndarray
>>> print np.count_nonzero(Xpca), Xpca.size
95000, 100000
which suggests that 95000 of the entries are non-zero, however,
>>> np.isclose(Xpca, 0, atol=1e-15).sum(), Xpca.size
99481, 100000
99481 elements are close to 0
(<1e-15
), but not 0
.
Which means, in short, that for a PCA, even if the input is an sparse matrix, the output is not. Thus, if you try to extract 100,000,000 (1e8
) components from your matrix, you will end up with a 1e8 x n_features
(in your example 1e8 x 1617899
) dense matrix, which of course, can't be hold in memory.
I'm not an expert statistician, but I believe there is currently no workaraound for this using scikit-learn, as is not a problem of scikit-learn's implementation, is just the mathematical definition of their Sparse PCA (by means of sparse SVD) which makes the result dense.
The only workaround that might work for you, is for you to start from a small amount of components, and increase it until you get a balance between the data that you can keep in memory, and the percentage of the data explained (which you can calculate as follows):
>>> clf.explained_variance_ratio_.sum()
这篇关于使用 sklearn 在大型稀疏矩阵上执行 PCA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!