从Scipy稀疏矩阵创建稀疏RDD [英] Create sparse RDD from scipy sparse matrix

查看:109
本文介绍了从Scipy稀疏矩阵创建稀疏RDD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个来自scipy的大型稀疏矩阵(300k x 100k,所有二进制值,大多数为零).我想将此矩阵的行设置为RDD,然后在这些行上进行一些计算-评估每行上的函数,评估行对上的函数,等等.

I have a large sparse matrix from scipy (300k x 100k with all binary values, mostly zeros). I would like to set the rows of this matrix to be an RDD and then do some computations on those rows - evaluate a function on each row, evaluate functions on pairs of rows, etc.

关键是它非常稀疏,并且我不想爆炸群集-我可以将行转换为SparseVectors吗?还是将整个内容转换为SparseMatrix?

Key thing is that it's quite sparse and I don't want to explode the cluster - can I convert the rows to SparseVectors? Or perhaps convert the whole thing to SparseMatrix?

您能否举一个例子,在其中读取稀疏数组,将行设置为RDD,并从这些行的笛卡尔积中计算出一些东西?

Can you give an example where you read in a sparse array, setup rows into an RDD, and compute something from the cartesian product of those rows?

推荐答案

我最近遇到了这个问题-我认为您可以通过使用scipy csc_matrix属性构造SparseMatrix来直接进行转换. (从杨布赖恩借来的)

I had this issue recently--I think you can convert directly by constructing the SparseMatrix with the scipy csc_matrix attributes. (Borrowing from Yang Bryan)

import numpy as np
import scipy.sparse as sps
from pyspark.mllib.linalg import Matrices

# create a sparse matrix
row = np.array([0, 2, 2, 0, 1, 2])
col = np.array([0, 0, 1, 2, 2, 2])
data = np.array([1, 2, 3, 4, 5, 6]) 
sv = sps.csc_matrix((data, (row, col)), shape=(3, 3))

# convert to pyspark SparseMatrix
sparse_matrix = Matrices.sparse(sv.shape[0],sv.shape[1],sv.indptr,sv.indices,sv.data)

这篇关于从Scipy稀疏矩阵创建稀疏RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆