pyspark:稀疏向量到稀疏矩阵 [英] pyspark: sparse vectors to scipy sparse matrix

查看:232
本文介绍了pyspark:稀疏向量到稀疏矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个spark数据框,其中有一列短句子,以及一列带有分类变量.我想对句子执行tf-idf,对分类变量执行one-hot-encoding,然后将其输出到驱动程序上的稀疏矩阵中,一旦它的大小变得小得多(对于scikit-learn模型).

I have a spark dataframe with a column of short sentences, and a column with a categorical variable. I'd like to perform tf-idf on the sentences, one-hot-encoding on the categorical variable and then output it to a sparse matrix on my driver once it's much smaller in size (for a scikit-learn model).

以稀疏形式获取数据的最佳方法是什么?稀疏向量上似乎只有一个toArray()方法,可以输出numpy数组.但是,文档确实说scipy稀疏数组

What is the best way to get the data out of spark in sparse form? It seems like there is only a toArray() method on sparse vectors, which outputs numpy arrays. However, the docs do say that scipy sparse arrays can be used in the place of spark sparse arrays.

还请记住,tf_idf值实际上是一列稀疏数组.理想情况下,将所有这些功能整合到一个大的稀疏矩阵中会很好.

Keep in mind also that the tf_idf values are in fact a column of sparse arrays. Ideally it would be nice to get all these features into one large sparse matrix.

推荐答案

一种可能的解决方案可以表示为:

One possible solution can be expressed as follows:

  • 将特征转换为RDD并提取向量:

from pyspark.ml.linalg import SparseVector
from operator import attrgetter

df = sc.parallelize([
    (SparseVector(3, [0, 2], [1.0, 3.0]), ),
    (SparseVector(3, [1], [4.0]), )
]).toDF(["features"])

features = df.rdd.map(attrgetter("features"))

  • 添加行索引:

  • add row indices:

    indexed_features = features.zipWithIndex()
    

  • 展平为元组(i, j, value)的RDD:

  • flatten to RDD of tuples (i, j, value):

    def explode(row):
        vec, i = row
        for j, v in zip(vec.indices, vec.values):
            yield i, j, v
    
    entries = indexed_features.flatMap(explode)
    

  • 收集并重塑:

  • collect and reshape:

    row_indices, col_indices, data = zip(*entries.collect())
    

  • 计算形状:

  • compute shape:

    shape = (
        df.count(),
        df.rdd.map(attrgetter("features")).first().size
    )
    

  • 创建稀疏矩阵:

  • create sparse matrix:

    from scipy.sparse import csr_matrix
    
    mat = csr_matrix((data, (row_indices, col_indices)), shape=shape)
    

  • 快速健全性检查:

  • quick sanity check:

    mat.todense()
    

    预期结果:

    matrix([[ 1.,  0.,  3.],
            [ 0.,  4.,  0.]])
    

  • 另一个:

    • features的每一行转换为矩阵:

    • convert each row of features to matrix:

    import numpy as np
    
    def as_matrix(vec):
        data, indices = vec.values, vec.indices
        shape = 1, vec.size
        return csr_matrix((data, indices, np.array([0, vec.values.size])), shape)
    
    mats = features.map(as_matrix)
    

  • 并用vstack减小:

    from scipy.sparse import vstack
    
    mat = mats.reduce(lambda x, y: vstack([x, y]))
    

    collectvstack

    mat = vstack(mats.collect())
    

  • 这篇关于pyspark:稀疏向量到稀疏矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆