从scala到python的PySpark RDD稀疏矩阵乘法 [英] PySpark RDD Sparse Matrix multiplication from scala to python

查看:153
本文介绍了从scala到python的PySpark RDD稀疏矩阵乘法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我以前发布过一个有关900万行和85,000列的坐标矩阵乘法的问题. Spark中的块矩阵乘法错误

I previously posted a question on coordinate matrix multiplication with 9 million rows and 85K columns. Errors for block matrix multiplification in Spark

但是,我在DataProc上遇到了内存不足的问题.我试图用高内存核心配置群集,但是没有运气.

However, I ran into Out of Memory issue on DataProc. I have tried to configure the cluster with high memory cores, but with no luck.

我正在阅读本文,认为对我的情况可能有帮助: https://www.balabit.com/blog/scalable-sparse-matrix-multiplication-in-apache-spark/但是,他们提供的解决方案是我不熟悉的Scala.有人可以将这个代码翻译成python吗?谢谢一堆!

I am reading this article and thought it may help in my case: https://www.balabit.com/blog/scalable-sparse-matrix-multiplication-in-apache-spark/ However, the solution they provided is in Scala, which I am not familiar with. Can someone kind enough to translate this code to python? Thanks a bunch!

def coordinateMatrixMultiply(leftMatrix: CoordinateMatrix, rightMatrix: CoordinateMatrix): 
    CoordinateMatrix = {
    val M_ = leftMatrix.entries.map({ case MatrixEntry(i, j, v) => (j, (i, v)) })
    val N_ = rightMatrix.entries.map({ case MatrixEntry(j, k, w) => (j, (k, w)) })

    val productEntries = M_
    .join(N_)
    .map({ case (_, ((i, v), (k, w))) => ((i, k), (v * w)) })
    .reduceByKey(_ + _)
    .map({ case ((i, k), sum) => MatrixEntry(i, k, sum) })

    new CoordinateMatrix(productEntries)
}

推荐答案

Python 3.x的实现

  1. 由于在Python 3中, lambda 函数中没有元组解包,因此我们必须引用
  1. Since in Python 3 there is no tuple unpacking in lambda functions, we have to reference the MatrixEntry by a single variable e.
  2. Also, MatrixEntry is not indexable so we must call the individual properties i, j and value.

def coordinateMatrixMultiply(leftmatrix, rightmatrix):
    left  =  leftmatrix.entries.map(lambda e: (e.j, (e.i, e.value)))
    right = rightmatrix.entries.map(lambda e: (e.i, (e.j, e.value)))
    productEntries = left \
        .join(right) \
        .map(lambda e: ((e[1][0][0], e[1][1][0]), (e[1][0][1]*e[1][1][1]))) \
        .reduceByKey(lambda x,y: x+y) \
        .map(lambda e: (*e[0], e[1]))
    return productEntries

这篇关于从scala到python的PySpark RDD稀疏矩阵乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆