scikit-learning 如何对 libsvm 格式的稀疏数据执行 PCA? [英] How can scikit-learning perform PCA on sparse data in libsvm format?

查看:23
本文介绍了scikit-learning 如何对 libsvm 格式的稀疏数据执行 PCA?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 scikit-learning 来做一些降维任务.我的训练/测试数据采用 libsvm 格式.它是一个包含 50 万列的大型稀疏矩阵.

I am using scikit-learning to do some dimension reduce task. My training/test data is in the libsvm format. It is a large sparse matrix in half million columns.

我使用load_svmlight_file函数加载数据,使用SparsePCA,scikit-learning抛出输入数据错误异常.

I use load_svmlight_file function load the data, and by using SparsePCA, the scikit-learning throw out an exception of the input data error.

如何解决?

推荐答案

稀疏 PCA 是一种用于在密集数据上寻找稀疏分解(组件具有稀疏约束)的算法.

Sparse PCA is an algorithm for finding a sparse decomposition (the components have a sparsity constraint) on dense data.

如果你想对稀疏数据进行普通 PCA,你应该使用 sklearn.decomposition.RandomizedPCA,它实现了一种适用于稀疏和密集数据的可扩展近似方法.

If you want to do vanilla PCA on sparse data you should use sklearn.decomposition.RandomizedPCA that implements an scalable approximate method that works on both sparse and dense data.

IIRC sklearn.decomposition.PCA 目前仅适用于密集数据.将来可以通过将稀疏数据矩阵上的 SVD 计算委托给 arpack 来添加对稀疏数据的支持.

IIRC sklearn.decomposition.PCA only works on dense data at the moment. Support for sparse data could be added in the future by delegating the SVD computation on the sparse data matrix to arpack for instance.

编辑:如评论中所述,RandomizedPCA 的稀疏输入已被弃用:相反,您应该使用 sklearn.decomposition.TruncatedSVD 来精确执行RandomizedPCA 过去在稀疏数据上做了什么,但一开始就不应该被称为 PCA.

Edit: as noted in the comments sparse input for RandomizedPCA is deprecated: instead you should use sklearn.decomposition.TruncatedSVD that does precisely what RandomizedPCA used to do on sparse data but should not have been called PCA in the first place.

澄清:PCA 在数学上定义为将数据居中(去除每个特征的平均值),然后对居中的数据应用截断的 SVD.

To clarify: PCA is mathematically defined as centering the data (removing the mean value to each feature) and then applying truncated SVD on the centered data.

由于将数据居中会破坏稀疏性并强制使用通常不再适合内存的密集表示,因此通常直接对稀疏数据进行截断 SVD(不居中).这类似于 PCA,但并不完全相同.这在 scikit-learn 中实现为 sklearn.decomposition.TruncatedSVD.

As centering the data would destroy the sparsity and force a dense representation that often does not fit in memory any more, it is common to directly do truncated SVD on sparse data (without centering). This resembles PCA but it's not exactly the same. This is implemented in scikit-learn as sklearn.decomposition.TruncatedSVD.

编辑(2019 年 3 月):正在进行对隐式居中稀疏数据实施 PCA 的工作:https://github.com/scikit-learn/scikit-learn/pull/12841

Edit (March 2019): There is ongoing work to implement PCA on sparse data with implicit centering: https://github.com/scikit-learn/scikit-learn/pull/12841

这篇关于scikit-learning 如何对 libsvm 格式的稀疏数据执行 PCA?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆