Pyspark和PCA:我怎样才能提取此PCA的特征向量?如何计算它们有多少差异解释? [英] Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?

查看:2477
本文介绍了Pyspark和PCA:我怎样才能提取此PCA的特征向量?如何计算它们有多少差异解释?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我减少的维度一个星火与pyspark数据帧 PCA 模式(使用火花 毫升库)如下:

I am reducing the dimensionality of a Spark DataFrame with PCA model with pyspark (using the spark ml library) as follows:

pca = PCA(k=3, inputCol="features", outputCol="pca_features")
model = pca.fit(data)

其中,数据星火数据帧有一列labed 功能至极是 DenseVector

where data is a Spark DataFrame with one column labed features wich is a DenseVector of 3 dimensions:

data.take(1)
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1')

装修后,我转换数据:

After fitting, I transform the data:

transformed = model.transform(data)
transformed.first()
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1', pca_features=DenseVector([-0.33256, 0.8668, 0.625]))

我的问题是:我如何提取此PCA的特征向量?如何计算它们有多少差异解释?

My question is: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?

推荐答案

嗯,这似乎令人难以置信,但确实还没有一个方法来提取从PCA分解(至少星火1.5)这样的信息。但同样,也出现了不少类似的抱怨 - 见<一href=\"http://stackoverflow.com/questions/31749593/how-to-extract-best-parameters-from-a-crossvalidatormodel\">here,例如,对于不能够从提取的最佳参数 CrossValidatorModel

Well, it seems incredible, but indeed there is not a way to extract such information from a PCA decomposition (at least as of Spark 1.5). But again, there have been many similar "complaints" - see here, for example, for not being able to extract the best parameters from a CrossValidatorModel.

幸运的是,几个月前,我参加的可扩展的机器学习'通过AMPLab MOOC(伯克利)及Databricks,即星火,我们实现了一个完整的PCA管道'手'作为家庭作业的一部分的创造者。我已经修改我的功能从当时(放心,我有充分的信贷:-),以便与dataframes作为输入(而不是RDD的工作),相同的格式为您(即的行DenseVectors 包含数值的功能)。

Fortunately, some months ago, I attended the 'Scalable Machine Learning' MOOC by AMPLab (Berkeley) & Databricks, i.e. the creators of Spark, where we implemented a full PCA pipeline 'by hand' as part of the homework assignments. I have modified my functions from back then (rest assured, I got full credit :-), so as to work with dataframes as inputs (instead of RDD's), of the same format as yours (i.e. Rows of DenseVectors containing the numerical features).

我们首先需要定义一个中间的功能, estimatedCovariance ,如下:

We first need to define an intermediate function, estimatedCovariance, as follows:

import numpy as np

def estimateCovariance(df):
    """Compute the covariance matrix for a given dataframe.

    Note:
        The multi-dimensional covariance array should be calculated using outer products.  Don't
        forget to normalize the data by first subtracting the mean.

    Args:
        df:  A Spark dataframe with a column named 'features', which (column) consists of DenseVectors.

    Returns:
        np.ndarray: A multi-dimensional array where the number of rows and columns both equal the
            length of the arrays in the input dataframe.
    """
    m = df.select(df['features']).map(lambda x: x[0]).mean()
    dfZeroMean = df.select(df['features']).map(lambda x:   x[0]).map(lambda x: x-m)  # subtract the mean

    return dfZeroMean.map(lambda x: np.outer(x,x)).sum()/df.count()

然后,如下我们可以写一个主 PCA 功能:

from numpy.linalg import eigh

def pca(df, k=2):
    """Computes the top `k` principal components, corresponding scores, and all eigenvalues.

    Note:
        All eigenvalues should be returned in sorted order (largest to smallest). `eigh` returns
        each eigenvectors as a column.  This function should also return eigenvectors as columns.

    Args:
        df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors.
        k (int): The number of principal components to return.

    Returns:
        tuple of (np.ndarray, RDD of np.ndarray, np.ndarray): A tuple of (eigenvectors, `RDD` of
        scores, eigenvalues).  Eigenvectors is a multi-dimensional array where the number of
        rows equals the length of the arrays in the input `RDD` and the number of columns equals
        `k`.  The `RDD` of scores has the same number of rows as `data` and consists of arrays
        of length `k`.  Eigenvalues is an array of length d (the number of features).
     """
    cov = estimateCovariance(df)
    col = cov.shape[1]
    eigVals, eigVecs = eigh(cov)
    inds = np.argsort(eigVals)
    eigVecs = eigVecs.T[inds[-1:-(col+1):-1]]  
    components = eigVecs[0:k]
    eigVals = eigVals[inds[-1:-(col+1):-1]]  # sort eigenvals
    score = df.select(df['features']).map(lambda x: x[0]).map(lambda x: np.dot(x, components.T) )
    # Return the `k` principal components, `k` scores, and all eigenvalues

    return components.T, score, eigVals

测试

让我们来看看第一个与现有方法的结果,利用星火ML PCA的文档(修改它们,以便所有的 DenseVectors

Let's see first the results with the existing method, using the example data from the Spark ML PCA documentation (modifying them so as to be all DenseVectors):

 from pyspark.ml.feature import *
 from pyspark.mllib.linalg import Vectors
 data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),),
         (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
         (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
 df = sqlContext.createDataFrame(data,["features"])
 pca_extracted = PCA(k=2, inputCol="features", outputCol="pca_features")
 model = pca_extracted.fit(df)
 model.transform(df).collect()

 [Row(features=DenseVector([0.0, 1.0, 0.0, 7.0, 0.0]), pca_features=DenseVector([1.6486, -4.0133])),
  Row(features=DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]), pca_features=DenseVector([-4.6451, -1.1168])),
  Row(features=DenseVector([4.0, 0.0, 0.0, 6.0, 7.0]), pca_features=DenseVector([-6.4289, -5.338]))]

然后,用我们的方法:

Then, with our method:

 comp, score, eigVals = pca(df)
 score.collect()

 [array([ 1.64857282,  4.0132827 ]),
  array([-4.64510433,  1.11679727]),
  array([-6.42888054,  5.33795143])]

我要强调,我们的不要使用任何收集()在我们定义的函数的方法 - 评分 RDD ,因为它应该是。

Let me stress that we don't use any collect() methods in the functions we have defined - score is an RDD, as it should be.

请注意,我们的第二个栏的标志是从现有的方法得到的那些所有的对面;但这不是一个问题:根据(可自由下载)导论统计学习,由黑斯蒂和放大器共同撰写; Tibshirani,第382

Notice that the signs of our second column are all opposite from the ones derived by the existing method; but this is not an issue: according to the (freely downloadable) An Introduction to Statistical Learning, co-authored by Hastie & Tibshirani, p. 382

每个主成分载荷向量是独一无二的,最多一个标志翻转。此
  是指两个不同的软件包将产生相同的主要
  组件载荷向量,虽然这些载荷向量的迹象
  可能有所不同。迹象可能会有所不同,因为每个主成分载荷
  矢量指定p维空间中的方向:翻转标志没有
  实现作为方向不会改变。 [...]类似地,得分向量是唯一的
  达一个符号翻转,因为z的方差是相同的-Z的方差

Each principal component loading vector is unique, up to a sign flip. This means that two different software packages will yield the same principal component loading vectors, although the signs of those loading vectors may differ. The signs may differ because each principal component loading vector specifies a direction in p-dimensional space: flipping the sign has no effect as the direction does not change. [...] Similarly, the score vectors are unique up to a sign flip, since the variance of Z is the same as the variance of −Z.

最后,现在我们拥有的特征值可用,这是微不足道的编写一个函数的方差的比例解释说:

Finally, now that we have the eigenvalues available, it is trivial to write a function for the percentage of the variance explained:

 def varianceExplained(df, k=1):
     """Calculate the fraction of variance explained by the top `k` eigenvectors.

     Args:
         df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors.
         k: The number of principal components to consider.

     Returns:
         float: A number between 0 and 1 representing the percentage of variance explained
             by the top `k` eigenvectors.
     """
     components, scores, eigenvalues = pca(df, k)  
     return sum(eigenvalues[0:k])/sum(eigenvalues)


 varianceExplained(df,1)
 # 0.79439325322305299

作为一个测试,我们也检查方差在我们的例子说明数据是1.0,对于k = 5(因为原来的数据是5维):

As a test, we also check if the variance explained in our example data is 1.0, for k=5 (since the original data are 5-dimensional):

 varianceExplained(df,5)
 # 1.0

这应该做你的工作的有效;随意拿出你可能需要的任何澄清。

This should do your job efficiently; feel free to come up with any clarifications you may need.

[发达和放大器;与星火1.5.0与放大器测试; 1.5.1]

[Developed & tested with Spark 1.5.0 & 1.5.1]

这篇关于Pyspark和PCA:我怎样才能提取此PCA的特征向量?如何计算它们有多少差异解释?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆