在非常大的稀疏矩阵上应用PCA [英] Apply PCA on very large sparse matrix

查看:467
本文介绍了在非常大的稀疏矩阵上应用PCA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用R进行文本分类任务,我得到了一个文档项矩阵,其大小为22490 x 120,000(只有400万个非零条目,少于1%的条目).现在,我想通过使用PCA(主成分分析)来降低尺寸.不幸的是,R无法处理这个庞大的矩阵,因此我将这个稀疏矩阵存储在矩阵市场格式"的文件中,希望使用其他技术来进行PCA.

I am doing a text classification task with R, and I obtain a document-term matrix with size 22490 by 120,000 (only 4 million non-zero entries, less than 1% entries). Now I want to reduce the dimensionality by utilizing PCA (Principal Component Analysis). Unfortunately, R cannot handle this huge matrix, so I store this sparse matrix in a file in the "Matrix Market Format", hoping to use some other techniques to do PCA.

所以任何人都可以给我一些有用的库的提示(无论使用哪种编程语言),这些库可以轻松地使用这种大规模矩阵进行PCA,或者由我自己进行长期的PCA,换句话说,就是 首先计算协方差矩阵,然后计算协方差矩阵的特征值和特征向量 .

So could anyone give me some hints for useful libraries (whatever the programming language), which could do PCA with this large-scale matrix with ease, or do a longhand PCA by myself, in other words, calculate the covariance matrix at first, and then calculate the eigenvalues and eigenvectors for the covariance matrix.

我想要的是 计算所有PC(120,000),并仅选择占90%差异的前N个PC .显然,在这种情况下,我必须给先验阈值以将一些非常小的方差值设置为0(在协方差矩阵中),否则,协方差矩阵将不会稀疏,其大小将为120,000 x 120,000,即一台机器无法处理.同样,载荷(特征向量)将非常大,应以稀疏格式存储.

What I want is to calculate all PCs (120,000), and choose only the top N PCs, who accounts for 90% variance. Obviously, in this case, I have to give a threshold a priori to set some very tiny variance values to 0 (in the covariance matrix), otherwise, the covariance matrix will not be sparse and its size would be 120,000 by 120,000, which is impossible to handle with one single machine. Also, the loadings (eigenvectors) will be extremely large, and should be stored in sparse format.

非常感谢您的帮助!

注意:我正在使用一台具有24GB RAM和8个CPU内核的计算机.

Note: I am using a machine with 24GB RAM and 8 cpu cores.

推荐答案

Python工具包 scikit-learn 具有一些PCA变体,其中 可以处理 scipy.sparse支持的任何格式的稀疏矩阵. scipy.io.mmread 应该能够解析Matrix Market格式(不过,我从未尝试过).

The Python toolkit scikit-learn has a few PCA variants, of which RandomizedPCA can handle sparse matrices in any of the formats supported by scipy.sparse. scipy.io.mmread should be able to parse the Matrix Market format (I never tried it, though).

免责声明:我是scikit-learn开发团队的成员.

Disclaimer: I'm on the scikit-learn development team.

编辑:scikit-learn 0.14中已弃用了RandomizedPCA中的稀疏矩阵支持. TruncatedSVD应该代替使用.有关详细信息,请参见文档.

EDIT: the sparse matrix support from RandomizedPCA has been deprecated in scikit-learn 0.14. TruncatedSVD should be used in its stead. See the documentation for details.

这篇关于在非常大的稀疏矩阵上应用PCA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆