Spark MLlib和Spark ML中的PCA [英] PCA in Spark MLlib and Spark ML

查看:101
本文介绍了Spark MLlib和Spark ML中的PCA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Spark现在有两个机器学习库-Spark MLlib和Spark ML.它们在实现方面确实有些重叠,但是据我了解(作为整个Spark生态系统的新手),Spark ML是必经之路,而MLlib仍然主要是为了向后兼容.

我的问题非常具体,与PCA有关.在 MLlib 实现中,似乎存在一个局限性列数

spark.mllib支持PCA,以PCA格式存储面向行和瘦型矩阵以及任何Vector.

此外,如果您看一下Java代码示例,

列数应小,例如少于1000.

另一方面,如果您查看 ML 文档,则没有提到的限制.

所以,我的问题是-Spark ML是否也存在此限制?如果是这样,那么为什么即使在列数很大的情况下也存在局限性,并且有什么变通办法可以使用此实现?

解决方案

PCA包括找到一组可以表示数据的去相关的随机变量,并按照它们保留的方差量降序排列./p>

可以通过将数据点投影到特定的正交子空间上来找到这些变量.如果您的(均心)数据矩阵是 X ,则此子空间由 X ^ T X 的特征向量组成.

X 大时,假设尺寸为 n x d ,则可以通过以下方式计算 X ^ TX 自己计算矩阵每一行的外积,然后将所有结果相加.如果 d 很小,那么不管 n 有多大,这当然都适用于简单的map-reduce程序.这是因为每行的外部乘积本身就是一个 d x d 矩阵,每个工作人员都必须在主内存中对其进行操作.这就是为什么在处理许多列时可能会遇到麻烦的原因.

如果列数很大(行数不是很多),则实际上可以计算PCA.只需计算(平均居中)转置数据矩阵的SVD,然后将其乘以所得的特征向量和特征值对角矩阵的逆即可.有正交子空间.

底线:如果spark.ml实现每次都遵循第一种方法,则限制应相同.如果他们检查输入数据集的维度以决定是否应采用第二种方法,那么如果行数很小,则在处理大量列时将不会有问题.

尽管如此,限制是由您的工作人员拥有的内存量决定的,因此,也许他们让用户自己达到上限,而不是建议可能不适用于某些限制.这可能就是他们决定在新文档中不提及限制的原因.

更新:源代码表明,无论输入的维数如何,它们每次都采用第一种方法.实际限制为65535,并且在发出10,000的警告时会发出警告.

Spark now has two machine learning libraries - Spark MLlib and Spark ML. They do somewhat overlap in what is implemented, but as I understand (as a person new to the whole Spark ecosystem) Spark ML is the way to go and MLlib is still around mostly for backward compatibility.

My question is very concrete and related to PCA. In MLlib implementation there seems to be a limitation of the number of columns

spark.mllib supports PCA for tall-and-skinny matrices stored in row-oriented format and any Vectors.

Also, if you look at the Java code example there is also this

The number of columns should be small, e.g, less than 1000.

On the other hand, if you look at ML documentation, there are no limitations mentioned.

So, my question is - does this limitation also exists in Spark ML? And if so, why the limitation and is there any workaround to be able to use this implementation even if the number of columns is large?

解决方案

PCA consists in finding a set of decorrelated random variables that you can represent your data with, sorted in decreasing order with respect to the amount of variance they retain.

These variables can be found by projecting your data points onto a specific orthogonal subspace. If your (mean-centered) data matrix is X, this subspace is comprised of the eigenvectors of X^T X.

When X is large, say of dimensions n x d, you can compute X^T X by computing the outer product of each row of the matrix by itself, then adding all the results up. This is of course amenable to a simple map-reduce procedure if d is small, no matter how large n is. That's because the outer product of each row by itself is a d x d matrix, which will have to be manipulated in main memory by each worker. That's why you might run into trouble when handling many columns.

If the number of columns is large (and the number of rows not so much so) you can indeed compute PCA. Just compute the SVD of your (mean-centered) transposed data matrix and multiply it by the resulting eigenvectors and the inverse of the diagonal matrix of eigenvalues. There's your orthogonal subspace.

Bottom line: if the spark.ml implementation follows the first approach every time, then the limitation should be the same. If they check the dimensions of the input dataset to decide whether they should go for the second approach, then you won't have problems dealing with large numbers of columns if the number of rows is small.

Regardless of that, the limit is imposed by how much memory your workers have, so perhaps they let users hit the ceiling by themselves, rather than suggesting a limitation that may not apply for some. That might be the reason why they decided not to mention the limitation in the new docs.

Update: The source code reveals that they do take the first approach every time, regardless of the dimensionality of the input. The actual limit is 65535, and at 10,000 they issue a warning.

这篇关于Spark MLlib和Spark ML中的PCA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆