从scala到python的PySpark RDD稀疏矩阵乘法 [英] PySpark RDD Sparse Matrix multiplication from scala to python
问题描述
我以前发布过一个有关900万行和85,000列的坐标矩阵乘法的问题. Spark中的块矩阵乘法错误
I previously posted a question on coordinate matrix multiplication with 9 million rows and 85K columns. Errors for block matrix multiplification in Spark
但是,我在DataProc上遇到了内存不足的问题.我试图用高内存核心配置群集,但是没有运气.
However, I ran into Out of Memory issue on DataProc. I have tried to configure the cluster with high memory cores, but with no luck.
我正在阅读本文,认为对我的情况可能有帮助: https://www.balabit.com/blog/scalable-sparse-matrix-multiplication-in-apache-spark/但是,他们提供的解决方案是我不熟悉的Scala.有人可以将这个代码翻译成python吗?谢谢一堆!
I am reading this article and thought it may help in my case: https://www.balabit.com/blog/scalable-sparse-matrix-multiplication-in-apache-spark/ However, the solution they provided is in Scala, which I am not familiar with. Can someone kind enough to translate this code to python? Thanks a bunch!
def coordinateMatrixMultiply(leftMatrix: CoordinateMatrix, rightMatrix: CoordinateMatrix):
CoordinateMatrix = {
val M_ = leftMatrix.entries.map({ case MatrixEntry(i, j, v) => (j, (i, v)) })
val N_ = rightMatrix.entries.map({ case MatrixEntry(j, k, w) => (j, (k, w)) })
val productEntries = M_
.join(N_)
.map({ case (_, ((i, v), (k, w))) => ((i, k), (v * w)) })
.reduceByKey(_ + _)
.map({ case ((i, k), sum) => MatrixEntry(i, k, sum) })
new CoordinateMatrix(productEntries)
}
推荐答案
Python 3.x的实现
- Since in Python 3 there is no tuple unpacking in
lambda
functions, we have to reference theMatrixEntry
by a single variablee
. - Also,
MatrixEntry
is not indexable so we must call the individual propertiesi
,j
andvalue
.
def coordinateMatrixMultiply(leftmatrix, rightmatrix):
left = leftmatrix.entries.map(lambda e: (e.j, (e.i, e.value)))
right = rightmatrix.entries.map(lambda e: (e.i, (e.j, e.value)))
productEntries = left \
.join(right) \
.map(lambda e: ((e[1][0][0], e[1][1][0]), (e[1][0][1]*e[1][1][1]))) \
.reduceByKey(lambda x,y: x+y) \
.map(lambda e: (*e[0], e[1]))
return productEntries
这篇关于从scala到python的PySpark RDD稀疏矩阵乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!