矩阵乘法在Apache中星火 [英] Matrix Multiplication in Apache Spark

查看:213
本文介绍了矩阵乘法在Apache中星火的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Apache Spark和java做矩阵乘法。

我有2个主要问题:


  1. 如何创建RDD可以在Apache的星火重新present矩阵?

  2. 如何将两个这样的RDD?


解决方案

所有依赖于输入数据和尺寸,但一般来讲,​​你要的是不是一个 RDD 但一从<一个分布式数据结构的href=\"https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.package\"><$c$c>org.apache.spark.mllib.linalg.distributed.在这个时刻提供的<四个不同的实施href=\"https://spark.apache.org/docs/latest/api/scala/org/apache/spark/mllib/linalg/distributed/DistributedMatrix.html\"><$c$c>DistributedMatrix


  • <一个href=\"https://spark.apache.org/docs/latest/api/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html\"><$c$c>IndexedRowMatrix - 可以直接从创建RDD [IndexedRow] 其中,<一个href=\"https://spark.apache.org/docs/latest/api/scala/org/apache/spark/mllib/linalg/distributed/IndexedRow.html\"><$c$c>IndexedRow由行索引和 org.apache.spark.mllib.linalg.Vector

     进口org.apache.spark.mllib.linalg {向量,矩阵}
    进口org.apache.spark.mllib.linalg.distributed {IndexedRowMatrix,
      IndexedRow}VAL行= sc.parallelize(SEQ(
      (0L,阵列(1.0,0.0,0.0)),
      (0L,阵列(0.0,1.0,0.0)),
      (0L,阵列(0.0,0.0,1.0)))
    ){.MAP情况下(我,XS)=&GT; IndexedRow(ⅰ,Vectors.dense(XS))}VAL indexedRowMatrix =新IndexedRowMatrix(行)


  • <一个href=\"https://spark.apache.org/docs/latest/api/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.html\"><$c$c>RowMatrix - 类似于 IndexedRowMatrix ,但没有意义的行索引。可以直接从创建RDD [org.apache.spark.mllib.linalg.Vector]

     进口org.apache.spark.mllib.linalg.distributed.RowMatrixVAL rowMatrix =新RowMatrix(rows.map(_。矢量))


  • <一个href=\"https://spark.apache.org/docs/latest/api/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.html\"><$c$c>BlockMatrix - 从 RDD [((INT,INT),矩阵)来创建其中,元组的第一个元素包含块,第二个坐标是本地 org.apache.spark.mllib.linalg.Matrix

      VAL眼= Matrices.sparse(
      3,3,阵列(0,1,2,3),阵列(0,1,2),阵列(1,1,1))VAL块= sc.parallelize(SEQ(
       ((0,0),眼),((1,1),眼),((2,2),眼)))VAL分块矩阵=新分块矩阵(块3,3,9,9)


  • <一个href=\"https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix\"><$c$c>CoordinateMatrix - 从创建RDD [MatrixEntry] 其中,<一个href=\"https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry\"><$c$c>MatrixEntry包括行,列和价值。

     进口org.apache.spark.mllib.linalg.distributed {CoordinateMatrix,
      MatrixEntry}VAL项= sc.parallelize(SEQ(
       (0,0,3.0),(2,0,-5.0),(3,2,1.0),
       (4,1,6.0),(6,2,2.0),(8,1,4.0))
    ).MAP {壳体(I,J,V)=&GT; MatrixEntry(I,J,V)}VAL coordinateMatrix =新CoordinateMatrix(条目,9,3)


由当地矩阵前两种实现支持乘法:

  VAL localMatrix = Matrices.dense(3,2,阵列(1.0,2.0,3.0,4.0,5.0,6.0))indexedRowMatrix.multiply(localMatrix).rows.collect
//阵列(IndexedRow(0,[1.0,4.0]),IndexedRow(0,[2.0,5.0]),
// IndexedRow(0,[3.0,6.0]))

和第三个可以通过一个乘以另一个分块矩阵只要每块列在这个矩阵编号匹配每另一矩阵的块的行数。 CoordinateMatrix 不支持乘法,而是pretty容易创建和转换为其他类型的分布式矩阵:

  blockMatrix.multiply(coordinateMatrix.toBlockMatrix(3,3))

每个类型都有自己的优点和缺点,另外还有一些其他因素需要考虑当您使用疏或密的元素(向量或块矩阵)。由本地矩阵乘法通常是preferable因为它不需要昂贵的改组。

您可以找到有关每个类型的更多细节的MLlib数据类型指南

I am trying to do matrix multiplication using Apache Spark and Java.

I have 2 main questions:

  1. How to create RDD that can represent matrix in Apache Spark?
  2. How to multiply two such RDD?

解决方案

All depends on the input data and dimensions but generally speaking what you want is not a RDD but one of the distributed data structures from org.apache.spark.mllib.linalg.distributed. At this moment it provides four different implementations of the DistributedMatrix

  • IndexedRowMatrix - can be created directly from a RDD[IndexedRow] where IndexedRow consist of row index and org.apache.spark.mllib.linalg.Vector

    import org.apache.spark.mllib.linalg.{Vectors, Matrices}
    import org.apache.spark.mllib.linalg.distributed.{IndexedRowMatrix,
      IndexedRow}
    
    val rows =  sc.parallelize(Seq(
      (0L, Array(1.0, 0.0, 0.0)),
      (0L, Array(0.0, 1.0, 0.0)),
      (0L, Array(0.0, 0.0, 1.0)))
    ).map{case (i, xs) => IndexedRow(i, Vectors.dense(xs))}
    
    val indexedRowMatrix = new IndexedRowMatrix(rows)
    

  • RowMatrix - similar to IndexedRowMatrix but without meaningful row indices. Can be created directly from RDD[org.apache.spark.mllib.linalg.Vector]

    import org.apache.spark.mllib.linalg.distributed.RowMatrix
    
    val rowMatrix = new RowMatrix(rows.map(_.vector))      
    

  • BlockMatrix - can be created from RDD[((Int, Int), Matrix)] where first element of the tuple contains coordinates of the block and the second one is a local org.apache.spark.mllib.linalg.Matrix

    val eye = Matrices.sparse(
      3, 3, Array(0, 1, 2, 3), Array(0, 1, 2), Array(1, 1, 1))
    
    val blocks = sc.parallelize(Seq(
       ((0, 0), eye), ((1, 1), eye), ((2, 2), eye)))
    
    val blockMatrix = new BlockMatrix(blocks, 3, 3, 9, 9)
    

  • CoordinateMatrix - can be created from RDD[MatrixEntry] where MatrixEntry consist of row, column and value.

    import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix,
      MatrixEntry}
    
    val entries = sc.parallelize(Seq(
       (0, 0, 3.0), (2, 0, -5.0), (3, 2, 1.0),
       (4, 1, 6.0), (6, 2, 2.0), (8, 1, 4.0))
    ).map{case (i, j, v) => MatrixEntry(i, j, v)}
    
    val coordinateMatrix = new CoordinateMatrix(entries, 9, 3)
    

First two implementations support multiplication by a local Matrix:

val localMatrix = Matrices.dense(3, 2, Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))

indexedRowMatrix.multiply(localMatrix).rows.collect
// Array(IndexedRow(0,[1.0,4.0]), IndexedRow(0,[2.0,5.0]),
//   IndexedRow(0,[3.0,6.0]))

and the third one can be multiplied by an another BlockMatrix as long as number of columns per block in this matrix matches number of rows per block of the other matrix. CoordinateMatrix doesn't support multiplications but is pretty easy to create and transform to other types of distributed matrices:

blockMatrix.multiply(coordinateMatrix.toBlockMatrix(3, 3))

Each type has its own strong and weak sides and there are some additional factors to consider when you use sparse or dense elements (Vectors or block Matrices). Multiplying by a local matrix is usually preferable since it doesn't require expensive shuffling.

You can find more details about each type in the MLlib Data Types guide.

这篇关于矩阵乘法在Apache中星火的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆