使用sklearn在大型稀疏矩阵上执行PCA [英] Performing PCA on large sparse matrix by using sklearn

查看:628
本文介绍了使用sklearn在大型稀疏矩阵上执行PCA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将PCA应用于巨大的稀疏矩阵,在下面的链接中它说sklearn的randomPCA可以处理scipy稀疏格式的稀疏矩阵. 在非常大的稀疏矩阵上应用PCA

但是,我总是会出错.有人可以指出我做错了吗.

输入矩阵'X_train'包含float64中的数字:

>>>type(X_train)
<class 'scipy.sparse.csr.csr_matrix'>
>>>X_train.shape
(2365436, 1617899)
>>>X_train.ndim 
2
>>>X_train[0]     
<1x1617899 sparse matrix of type '<type 'numpy.float64'>'
    with 81 stored elements in Compressed Sparse Row format>

我正在尝试做

>>>from sklearn.decomposition import RandomizedPCA
>>>pca = RandomizedPCA()
>>>pca.fit(X_train)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 567, in fit
    self._fit(check_array(X))
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 334, in check_array
    copy, force_all_finite)
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 239, in _ensure_sparse_format
    raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

如果我尝试转换为密集矩阵,我认为我内存不足.

>>> pca.fit(X_train.toarray())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 949, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/coo.py", line 274, in toarray
    B = self._process_toarray_args(order, out)
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/base.py", line 800, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError

解决方案

由于PCA的性质,即使输入是稀疏矩阵,输出也不是.您可以通过一个简单的示例进行检查:

>>> from sklearn.decomposition import TruncatedSVD
>>> from scipy import sparse as sp

创建一个随机稀疏矩阵,其数据的0.01%为非零.

>>> X = sp.rand(1000, 1000, density=0.0001)

对其应用PCA:

>>> clf = TruncatedSVD(100)
>>> Xpca = clf.fit_transform(X)

现在,检查结果:

>>> type(X)
scipy.sparse.coo.coo_matrix
>>> type(Xpca)
numpy.ndarray
>>> print np.count_nonzero(Xpca), Xpca.size
95000, 100000

这表明95000个条目为非零,

>>> np.isclose(Xpca, 0, atol=1e-15).sum(), Xpca.size
99481, 100000

99481个元素 接近0(<1e-15),但不是 0.

简而言之,这意味着对于PCA,即使输入是稀疏矩阵,输出也不是.因此,如果尝试从矩阵中提取100,000,000(1e8)个分量,最终将得到一个1e8 x n_features(在您的示例中为1e8 x 1617899)密集矩阵,这当然不能保存在内存中.

我不是专家统计学家,但我相信使用scikit-learn尚无解决方法,这不是scikit-learn实施的问题,这只是其稀疏PCA的数学定义(通过稀疏SVD)使结果密集.

唯一可行的解​​决方法是,从少量组件开始,然后增加它,直到可以保留在内存中的数据与所解释的数据百分比之间达到平衡(您可以如下计算):

>>> clf.explained_variance_ratio_.sum()

I am trying to apply PCA on huge sparse matrix, in the following link it says that randomizedPCA of sklearn can handle sparse matrix of scipy sparse format. Apply PCA on very large sparse matrix

However, I always get error. Can someone point out what I am doing wrong.

Input matrix 'X_train' contains numbers in float64:

>>>type(X_train)
<class 'scipy.sparse.csr.csr_matrix'>
>>>X_train.shape
(2365436, 1617899)
>>>X_train.ndim 
2
>>>X_train[0]     
<1x1617899 sparse matrix of type '<type 'numpy.float64'>'
    with 81 stored elements in Compressed Sparse Row format>

I am trying to do:

>>>from sklearn.decomposition import RandomizedPCA
>>>pca = RandomizedPCA()
>>>pca.fit(X_train)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 567, in fit
    self._fit(check_array(X))
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 334, in check_array
    copy, force_all_finite)
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 239, in _ensure_sparse_format
    raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

if I try to convert to dense matrix, I think I am out of memory.

>>> pca.fit(X_train.toarray())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 949, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/coo.py", line 274, in toarray
    B = self._process_toarray_args(order, out)
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/base.py", line 800, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError

解决方案

Due to the nature of the PCA, even if the input is an sparse matrix, the output is not. You can check it with a quick example:

>>> from sklearn.decomposition import TruncatedSVD
>>> from scipy import sparse as sp

Create a random sparse matrix with 0.01% of its data as non-zeros.

>>> X = sp.rand(1000, 1000, density=0.0001)

Apply PCA to it:

>>> clf = TruncatedSVD(100)
>>> Xpca = clf.fit_transform(X)

Now, check the results:

>>> type(X)
scipy.sparse.coo.coo_matrix
>>> type(Xpca)
numpy.ndarray
>>> print np.count_nonzero(Xpca), Xpca.size
95000, 100000

which suggests that 95000 of the entries are non-zero, however,

>>> np.isclose(Xpca, 0, atol=1e-15).sum(), Xpca.size
99481, 100000

99481 elements are close to 0 (<1e-15), but not 0.

Which means, in short, that for a PCA, even if the input is an sparse matrix, the output is not. Thus, if you try to extract 100,000,000 (1e8) components from your matrix, you will end up with a 1e8 x n_features (in your example 1e8 x 1617899) dense matrix, which of course, can't be hold in memory.

I'm not an expert statistician, but I believe there is currently no workaraound for this using scikit-learn, as is not a problem of scikit-learn's implementation, is just the mathematical definition of their Sparse PCA (by means of sparse SVD) which makes the result dense.

The only workaround that might work for you, is for you to start from a small amount of components, and increase it until you get a balance between the data that you can keep in memory, and the percentage of the data explained (which you can calculate as follows):

>>> clf.explained_variance_ratio_.sum()

这篇关于使用sklearn在大型稀疏矩阵上执行PCA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆