Pyspark 和 PCA:如何提取此 PCA 的特征向量?我如何计算他们解释的差异有多大? [英] Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?

查看:39
本文介绍了Pyspark 和 PCA:如何提取此 PCA 的特征向量?我如何计算他们解释的差异有多大?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用带有 pyspark 的 PCA 模型降低 Spark DataFrame 的维度(使用 spark ml库)如下:

I am reducing the dimensionality of a Spark DataFrame with PCA model with pyspark (using the spark ml library) as follows:

pca = PCA(k=3, inputCol="features", outputCol="pca_features")
model = pca.fit(data)

其中 data 是一个 Spark DataFrame,其中一列标记为 features,它是一个 3 维的 DenseVector:

where data is a Spark DataFrame with one column labeled features which is a DenseVector of 3 dimensions:

data.take(1)
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1')

拟合后,我转换数据:

transformed = model.transform(data)
transformed.first()
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1', pca_features=DenseVector([-0.33256, 0.8668, 0.625]))

我怎样才能提取这个 PCA 的特征向量?我如何计算他们解释的差异有多大?

How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?

推荐答案

[更新: 从 Spark 2.2 开始,PCA 和 SVD 在 PySpark 中都可用 - 请参阅 JIRA 票证 SPARK-6227PCA &PCAModel 用于 Spark ML 2.2;下面的原始答案仍然适用于较旧的 Spark 版本.]

[UPDATE: From Spark 2.2 onwards, PCA and SVD are both available in PySpark - see JIRA ticket SPARK-6227 and PCA & PCAModel for Spark ML 2.2; original answer below is still applicable for older Spark versions.]

好吧,这似乎令人难以置信,但确实没有办法从 PCA 分解中提取此类信息(至少从 Spark 1.5 开始).但同样,也有许多类似的投诉"- 例如,请参阅此处,因为无法从 CrossValidatorModel 中提取最佳参数.

Well, it seems incredible, but indeed there is not a way to extract such information from a PCA decomposition (at least as of Spark 1.5). But again, there have been many similar "complaints" - see here, for example, for not being able to extract the best parameters from a CrossValidatorModel.

幸运的是,几个月前,我参加了 'Scalable Machine Learning' MOOC by AMPLab (Berkeley) &Databricks,即 Spark 的创建者,我们在其中手工"实现了完整的 PCA 管道,作为家庭作业的一部分.我从那时起修改了我的函数(请放心,我得到了充分的信任 :-),以便将数据帧作为输入(而不是 RDD)使用,与您的格式相同(即 DenseVectors 包含数字特征).

Fortunately, some months ago, I attended the 'Scalable Machine Learning' MOOC by AMPLab (Berkeley) & Databricks, i.e. the creators of Spark, where we implemented a full PCA pipeline 'by hand' as part of the homework assignments. I have modified my functions from back then (rest assured, I got full credit :-), so as to work with dataframes as inputs (instead of RDD's), of the same format as yours (i.e. Rows of DenseVectors containing the numerical features).

我们首先需要定义一个中间函数,estimatedCovariance,如下:

We first need to define an intermediate function, estimatedCovariance, as follows:

import numpy as np

def estimateCovariance(df):
    """Compute the covariance matrix for a given dataframe.

    Note:
        The multi-dimensional covariance array should be calculated using outer products.  Don't
        forget to normalize the data by first subtracting the mean.

    Args:
        df:  A Spark dataframe with a column named 'features', which (column) consists of DenseVectors.

    Returns:
        np.ndarray: A multi-dimensional array where the number of rows and columns both equal the
            length of the arrays in the input dataframe.
    """
    m = df.select(df['features']).map(lambda x: x[0]).mean()
    dfZeroMean = df.select(df['features']).map(lambda x:   x[0]).map(lambda x: x-m)  # subtract the mean

    return dfZeroMean.map(lambda x: np.outer(x,x)).sum()/df.count()

然后,我们可以写一个主要的pca函数如下:

Then, we can write a main pca function as follows:

from numpy.linalg import eigh

def pca(df, k=2):
    """Computes the top `k` principal components, corresponding scores, and all eigenvalues.

    Note:
        All eigenvalues should be returned in sorted order (largest to smallest). `eigh` returns
        each eigenvectors as a column.  This function should also return eigenvectors as columns.

    Args:
        df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors.
        k (int): The number of principal components to return.

    Returns:
        tuple of (np.ndarray, RDD of np.ndarray, np.ndarray): A tuple of (eigenvectors, `RDD` of
        scores, eigenvalues).  Eigenvectors is a multi-dimensional array where the number of
        rows equals the length of the arrays in the input `RDD` and the number of columns equals
        `k`.  The `RDD` of scores has the same number of rows as `data` and consists of arrays
        of length `k`.  Eigenvalues is an array of length d (the number of features).
     """
    cov = estimateCovariance(df)
    col = cov.shape[1]
    eigVals, eigVecs = eigh(cov)
    inds = np.argsort(eigVals)
    eigVecs = eigVecs.T[inds[-1:-(col+1):-1]]  
    components = eigVecs[0:k]
    eigVals = eigVals[inds[-1:-(col+1):-1]]  # sort eigenvals
    score = df.select(df['features']).map(lambda x: x[0]).map(lambda x: np.dot(x, components.T) )
    # Return the `k` principal components, `k` scores, and all eigenvalues

    return components.T, score, eigVals

测试

让我们先看看现有方法的结果,使用来自 Spark ML PCA 的示例数据 文档(将它们修改为所有 DenseVectors):

Let's see first the results with the existing method, using the example data from the Spark ML PCA documentation (modifying them so as to be all DenseVectors):

 from pyspark.ml.feature import *
 from pyspark.mllib.linalg import Vectors
 data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),),
         (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
         (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
 df = sqlContext.createDataFrame(data,["features"])
 pca_extracted = PCA(k=2, inputCol="features", outputCol="pca_features")
 model = pca_extracted.fit(df)
 model.transform(df).collect()

 [Row(features=DenseVector([0.0, 1.0, 0.0, 7.0, 0.0]), pca_features=DenseVector([1.6486, -4.0133])),
  Row(features=DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]), pca_features=DenseVector([-4.6451, -1.1168])),
  Row(features=DenseVector([4.0, 0.0, 0.0, 6.0, 7.0]), pca_features=DenseVector([-6.4289, -5.338]))]

然后,使用我们的方法:

Then, with our method:

 comp, score, eigVals = pca(df)
 score.collect()

 [array([ 1.64857282,  4.0132827 ]),
  array([-4.64510433,  1.11679727]),
  array([-6.42888054,  5.33795143])]

让我强调一下,我们在我们定义的函数中使用任何 collect() 方法 - score 是一个 RDD,理所当然.

Let me stress that we don't use any collect() methods in the functions we have defined - score is an RDD, as it should be.

请注意,我们第二列的符号与现有方法得出的符号完全相反;但这不是问题:根据(可免费下载)统计学习简介,由 Hastie & 合着蒂布希拉尼,第.382

Notice that the signs of our second column are all opposite from the ones derived by the existing method; but this is not an issue: according to the (freely downloadable) An Introduction to Statistical Learning, co-authored by Hastie & Tibshirani, p. 382

每个主成分加载向量都是唯一的,直到符号翻转.这个意味着两个不同的软件包将产生相同的主体组件加载向量,虽然这些加载向量的符号可能不同.符号可能不同,因为每个主成分加载向量指定 p 维空间中的方向:翻转符号没有作用方向不变.[...] 同样,得分向量是唯一的直到符号翻转,因为 Z 的方差与 -Z 的方差相同.

Each principal component loading vector is unique, up to a sign flip. This means that two different software packages will yield the same principal component loading vectors, although the signs of those loading vectors may differ. The signs may differ because each principal component loading vector specifies a direction in p-dimensional space: flipping the sign has no effect as the direction does not change. [...] Similarly, the score vectors are unique up to a sign flip, since the variance of Z is the same as the variance of −Z.

最后,既然我们有了可用的特征值,那么编写一个解释方差百分比的函数就很简单了:

Finally, now that we have the eigenvalues available, it is trivial to write a function for the percentage of the variance explained:

 def varianceExplained(df, k=1):
     """Calculate the fraction of variance explained by the top `k` eigenvectors.

     Args:
         df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors.
         k: The number of principal components to consider.

     Returns:
         float: A number between 0 and 1 representing the percentage of variance explained
             by the top `k` eigenvectors.
     """
     components, scores, eigenvalues = pca(df, k)  
     return sum(eigenvalues[0:k])/sum(eigenvalues)

 
 varianceExplained(df,1)
 # 0.79439325322305299

作为测试,我们还检查示例数据中解释的方差是否为 1.0,对于 k=5(因为原始数据是 5 维的):

As a test, we also check if the variance explained in our example data is 1.0, for k=5 (since the original data are 5-dimensional):

 varianceExplained(df,5)
 # 1.0

[开发&用 Spark 1.5.0 & 测试1.5.1]

[Developed & tested with Spark 1.5.0 & 1.5.1]

这篇关于Pyspark 和 PCA:如何提取此 PCA 的特征向量?我如何计算他们解释的差异有多大?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆