在 PySpark 中将两个 numpy 矩阵相乘 [英] Multiply two numpy matrices in PySpark

查看：99 发布时间：2021/6/25 18:35:32 python numpy apache-spark pyspark

本文介绍了在 PySpark 中将两个 numpy 矩阵相乘的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有这两个 Numpy 数组:

Let's say I have these two Numpy arrays:

A = np.arange(1024 ** 2, dtype=np.float64).reshape(1024, 1024)
B = np.arange(1024 ** 2, dtype=np.float64).reshape(1024, 1024)

我对它们执行以下操作:

and I perform the following on them:

np.sum(np.dot(A, B))

现在，我希望能够使用 PySpark 对相同的矩阵执行相同的计算，以便通过我的 Spark 集群实现分布式计算.

Now, I'd like to be able to essentially perform the same calculation with the same matrices using PySpark in order to achieve a distributed calculation with my Spark cluster.

有没有人知道或有在 PySpark 中按照这些方式执行某些操作的示例?

Does anyone know or have a sample that does something along these lines in PySpark?

非常感谢您的帮助！

推荐答案

使用此 post，您可以执行以下操作(但请参阅@kennytm 的评论，为什么此方法对于较大的矩阵可能会很慢):

Using the as_block_matrix method from this post, you could do the following (but see the comment of @kennytm why this method can be slow for bigger matrices):

import numpy as np
from pyspark.mllib.linalg.distributed import RowMatrix
A = np.arange(1024 ** 2, dtype=np.float64).reshape(1024, 1024)
B = np.arange(1024 ** 2, dtype=np.float64).reshape(1024, 1024)

from pyspark.mllib.linalg.distributed import *

def as_block_matrix(rdd, rowsPerBlock=1024, colsPerBlock=1024):
    return IndexedRowMatrix(
        rdd.zipWithIndex().map(lambda xi: IndexedRow(xi[1], xi[0]))
    ).toBlockMatrix(rowsPerBlock, colsPerBlock)

matrixA = as_block_matrix(sc.parallelize(A))
matrixB = as_block_matrix(sc.parallelize(B))
product = matrixA.multiply(matrixB)

这篇关于在 PySpark 中将两个 numpy 矩阵相乘的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在 PySpark 中将两个 numpy 矩阵相乘 [英] Multiply two numpy matrices in PySpark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在 PySpark 中将两个 numpy 矩阵相乘 [英] Multiply two numpy matrices in PySpark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭