使用sklearn在大型稀疏矩阵上执行PCA [英] Performing PCA on large sparse matrix by using sklearn
问题描述
我试图将PCA应用于巨大的稀疏矩阵,在下面的链接中它说sklearn的randomPCA可以处理scipy稀疏格式的稀疏矩阵. 在非常大的稀疏矩阵上应用PCA
但是,我总是会出错.有人可以指出我做错了吗.
输入矩阵'X_train'包含float64中的数字:
>>>type(X_train)
<class 'scipy.sparse.csr.csr_matrix'>
>>>X_train.shape
(2365436, 1617899)
>>>X_train.ndim
2
>>>X_train[0]
<1x1617899 sparse matrix of type '<type 'numpy.float64'>'
with 81 stored elements in Compressed Sparse Row format>
我正在尝试做
>>>from sklearn.decomposition import RandomizedPCA
>>>pca = RandomizedPCA()
>>>pca.fit(X_train)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 567, in fit
self._fit(check_array(X))
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 334, in check_array
copy, force_all_finite)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 239, in _ensure_sparse_format
raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
如果我尝试转换为密集矩阵,我认为我内存不足.
>>> pca.fit(X_train.toarray())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 949, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/coo.py", line 274, in toarray
B = self._process_toarray_args(order, out)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/base.py", line 800, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
由于PCA的性质,即使输入是稀疏矩阵,输出也不是.您可以通过一个简单的示例进行检查:
>>> from sklearn.decomposition import TruncatedSVD
>>> from scipy import sparse as sp
创建一个随机稀疏矩阵,其数据的0.01%为非零.
>>> X = sp.rand(1000, 1000, density=0.0001)
对其应用PCA:
>>> clf = TruncatedSVD(100)
>>> Xpca = clf.fit_transform(X)
现在,检查结果:
>>> type(X)
scipy.sparse.coo.coo_matrix
>>> type(Xpca)
numpy.ndarray
>>> print np.count_nonzero(Xpca), Xpca.size
95000, 100000
这表明95000个条目为非零,
>>> np.isclose(Xpca, 0, atol=1e-15).sum(), Xpca.size
99481, 100000
99481个元素 接近0
(<1e-15
),但不是 0
.
1e8
)个分量,最终将得到一个1e8 x n_features
(在您的示例中为1e8 x 1617899
)密集矩阵,这当然不能保存在内存中.
我不是专家统计学家,但我相信使用scikit-learn尚无解决方法,这不是scikit-learn实施的问题,这只是其稀疏PCA的数学定义(通过稀疏SVD)使结果密集.
唯一可行的解决方法是,从少量组件开始,然后增加它,直到可以保留在内存中的数据与所解释的数据百分比之间达到平衡(您可以如下计算):
>>> clf.explained_variance_ratio_.sum()
I am trying to apply PCA on huge sparse matrix, in the following link it says that randomizedPCA of sklearn can handle sparse matrix of scipy sparse format. Apply PCA on very large sparse matrix
However, I always get error. Can someone point out what I am doing wrong.
Input matrix 'X_train' contains numbers in float64:
>>>type(X_train)
<class 'scipy.sparse.csr.csr_matrix'>
>>>X_train.shape
(2365436, 1617899)
>>>X_train.ndim
2
>>>X_train[0]
<1x1617899 sparse matrix of type '<type 'numpy.float64'>'
with 81 stored elements in Compressed Sparse Row format>
I am trying to do:
>>>from sklearn.decomposition import RandomizedPCA
>>>pca = RandomizedPCA()
>>>pca.fit(X_train)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 567, in fit
self._fit(check_array(X))
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 334, in check_array
copy, force_all_finite)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 239, in _ensure_sparse_format
raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
if I try to convert to dense matrix, I think I am out of memory.
>>> pca.fit(X_train.toarray())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 949, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/coo.py", line 274, in toarray
B = self._process_toarray_args(order, out)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/base.py", line 800, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
Due to the nature of the PCA, even if the input is an sparse matrix, the output is not. You can check it with a quick example:
>>> from sklearn.decomposition import TruncatedSVD
>>> from scipy import sparse as sp
Create a random sparse matrix with 0.01% of its data as non-zeros.
>>> X = sp.rand(1000, 1000, density=0.0001)
Apply PCA to it:
>>> clf = TruncatedSVD(100)
>>> Xpca = clf.fit_transform(X)
Now, check the results:
>>> type(X)
scipy.sparse.coo.coo_matrix
>>> type(Xpca)
numpy.ndarray
>>> print np.count_nonzero(Xpca), Xpca.size
95000, 100000
which suggests that 95000 of the entries are non-zero, however,
>>> np.isclose(Xpca, 0, atol=1e-15).sum(), Xpca.size
99481, 100000
99481 elements are close to 0
(<1e-15
), but not 0
.
Which means, in short, that for a PCA, even if the input is an sparse matrix, the output is not. Thus, if you try to extract 100,000,000 (1e8
) components from your matrix, you will end up with a 1e8 x n_features
(in your example 1e8 x 1617899
) dense matrix, which of course, can't be hold in memory.
I'm not an expert statistician, but I believe there is currently no workaraound for this using scikit-learn, as is not a problem of scikit-learn's implementation, is just the mathematical definition of their Sparse PCA (by means of sparse SVD) which makes the result dense.
The only workaround that might work for you, is for you to start from a small amount of components, and increase it until you get a balance between the data that you can keep in memory, and the percentage of the data explained (which you can calculate as follows):
>>> clf.explained_variance_ratio_.sum()
这篇关于使用sklearn在大型稀疏矩阵上执行PCA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!