使用 sklearn 在大型稀疏矩阵上执行 PCA [英] Performing PCA on large sparse matrix by using sklearn

查看:35
本文介绍了使用 sklearn 在大型稀疏矩阵上执行 PCA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在巨大的稀疏矩阵上应用 PCA,在下面的链接中它说 sklearn 的 randomPCA 可以处理 scipy 稀疏格式的稀疏矩阵.在非常大的稀疏矩阵上应用 PCA

但是,我总是出错.有人可以指出我做错了什么.

输入矩阵 'X_train' 包含 float64 中的数字:

>>>type(X_train)<类'scipy.sparse.csr.csr_matrix'>>>>X_train.shape(2365436, 1617899)>>>X_train.ndim2>>>X_train[0]<1x1617899 '<type'numpy.float64'>'类型的稀疏矩阵以压缩稀疏行格式存储 81 个元素>

我正在尝试:

>>>from sklearn.decomposition import RandomizedPCA>>>pca = RandomizedPCA()>>>pca.fit(X_train)回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/decomposition/pca.py",第567行,合适self._fit(check_array(X))check_array 中的文件/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py",第 334 行复制,force_all_finite)文件/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py",第239行,_ensure_sparse_formatraise TypeError('一个稀疏矩阵被传递,但密集'类型错误:传递了稀疏矩阵,但需要密集数据.使用 X.toarray() 转换为密集的 numpy 数组.

如果我尝试转换为密集矩阵,我想我的内存不足.

<预><代码>>>>pca.fit(X_train.toarray())回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/compressed.py",第949行,在toarray中返回 self.tocoo(copy=False).toarray(order=order, out=out)文件/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/coo.py",第274行,在toarray中B = self._process_toarray_args(order, out)文件/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/base.py",第800行,_process_toarray_args返回 np.zeros(self.shape, dtype=self.dtype, order=order)内存错误

解决方案

由于 PCA 的性质,即使输入是稀疏矩阵,输出也不是.你可以用一个简单的例子来检查它:

<预><代码>>>>从 sklearn.decomposition 导入 TruncatedSVD>>>from scipy import sparse as sp

创建一个随机稀疏矩阵,其中 0.01% 的数据为非零值.

<预><代码>>>>X = sp.rand(1000, 1000, 密度=0.0001)

对其应用 PCA:

<预><代码>>>>clf = TruncatedSVD(100)>>>Xpca = clf.fit_transform(X)

现在,检查结果:

<预><代码>>>>类型(X)scipy.sparse.coo.coo_matrix>>>类型(Xpca)numpy.ndarray>>>打印 np.count_nonzero(Xpca), Xpca.size95000、100000

然而,这表明 95000 个条目是非零的,

<预><代码>>>>np.isclose(Xpca, 0, atol=1e-15).sum(), Xpca.size99481、100000

99481 个元素接近 0(<1e-15),但不是 0.

简而言之,这意味着对于 PCA,即使输入是稀疏矩阵,输出也不是.因此,如果您尝试从矩阵中提取 100,000,000 (1e8) 个分量,您最终会得到一个 1e8 x n_features(在您的示例中为 1e8 x 1617899) 稠密矩阵,当然不能保存在内存中.

我不是专业的统计学家,但我相信目前没有使用 scikit-learn 解决这个问题的解决方法,因为这不是 scikit-learn 实现的问题,只是他们的 Sparse PCA 的数学定义(通过sparse SVD) 使结果密集.

可能对您有用的唯一解决方法是让您从少量组件开始,然后增加它,直到您在可以保留在内存中的数据与解释的数据百分比之间取得平衡(您可以按如下方式计算):

<预><代码>>>>clf.explained_variance_ratio_.sum()

I am trying to apply PCA on huge sparse matrix, in the following link it says that randomizedPCA of sklearn can handle sparse matrix of scipy sparse format. Apply PCA on very large sparse matrix

However, I always get error. Can someone point out what I am doing wrong.

Input matrix 'X_train' contains numbers in float64:

>>>type(X_train)
<class 'scipy.sparse.csr.csr_matrix'>
>>>X_train.shape
(2365436, 1617899)
>>>X_train.ndim 
2
>>>X_train[0]     
<1x1617899 sparse matrix of type '<type 'numpy.float64'>'
    with 81 stored elements in Compressed Sparse Row format>

I am trying to do:

>>>from sklearn.decomposition import RandomizedPCA
>>>pca = RandomizedPCA()
>>>pca.fit(X_train)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 567, in fit
    self._fit(check_array(X))
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 334, in check_array
    copy, force_all_finite)
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 239, in _ensure_sparse_format
    raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

if I try to convert to dense matrix, I think I am out of memory.

>>> pca.fit(X_train.toarray())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 949, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/coo.py", line 274, in toarray
    B = self._process_toarray_args(order, out)
  File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/base.py", line 800, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError

解决方案

Due to the nature of the PCA, even if the input is an sparse matrix, the output is not. You can check it with a quick example:

>>> from sklearn.decomposition import TruncatedSVD
>>> from scipy import sparse as sp

Create a random sparse matrix with 0.01% of its data as non-zeros.

>>> X = sp.rand(1000, 1000, density=0.0001)

Apply PCA to it:

>>> clf = TruncatedSVD(100)
>>> Xpca = clf.fit_transform(X)

Now, check the results:

>>> type(X)
scipy.sparse.coo.coo_matrix
>>> type(Xpca)
numpy.ndarray
>>> print np.count_nonzero(Xpca), Xpca.size
95000, 100000

which suggests that 95000 of the entries are non-zero, however,

>>> np.isclose(Xpca, 0, atol=1e-15).sum(), Xpca.size
99481, 100000

99481 elements are close to 0 (<1e-15), but not 0.

Which means, in short, that for a PCA, even if the input is an sparse matrix, the output is not. Thus, if you try to extract 100,000,000 (1e8) components from your matrix, you will end up with a 1e8 x n_features (in your example 1e8 x 1617899) dense matrix, which of course, can't be hold in memory.

I'm not an expert statistician, but I believe there is currently no workaraound for this using scikit-learn, as is not a problem of scikit-learn's implementation, is just the mathematical definition of their Sparse PCA (by means of sparse SVD) which makes the result dense.

The only workaround that might work for you, is for you to start from a small amount of components, and increase it until you get a balance between the data that you can keep in memory, and the percentage of the data explained (which you can calculate as follows):

>>> clf.explained_variance_ratio_.sum()

这篇关于使用 sklearn 在大型稀疏矩阵上执行 PCA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆