添加两个 RDD[mllib.linalg.Vector]'s [英] Addition of two RDD[mllib.linalg.Vector]'s

查看:20
本文介绍了添加两个 RDD[mllib.linalg.Vector]'s的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要添加存储在两个文件中的两个矩阵.

latest1.txtlatest2.txt的内容有下一个str:

<前>1 2 34 5 67 8 9

我正在阅读这些文件如下:

scala>val rows = sc.textFile(latest1.txt").map { line =>val values = line.split(‘ ‘).map(_.toDouble)Vectors.sparse(values.length,values.zipWithIndex.map(e => (e._2, e._1)).filter(_._2 != 0.0))}标度>val r1 = 行r1: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[2] at map at :14标度>val rows = sc.textFile(latest2.txt").map { line =>val values = line.split(‘ ‘).map(_.toDouble)Vectors.sparse(values.length,values.zipWithIndex.map(e => (e._2, e._1)).filter(_._2 != 0.0))}标度>val r2 = 行r2: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[2] at map at :14

我想添加 r1、r2.那么,有没有办法在Apache-Spark中添加这两个RDD[mllib.linalg.Vector].

解决方案

这其实是个好问题.我经常使用 mllib,但没有意识到这些基本的线性代数运算并不容易实现.

关键是基础breeze向量具有您期望的所有线性代数操作 - 当然包括您特别提到的基本元素明智的加法.

然而,微风实现是通过以下方式对外界隐藏的:

[私有mllib]

那么,从外部世界/公共 API 的角度来看,我们如何访问这些原语?

其中一些已经暴露:例如平方和:

/*** 返回两个向量之间的平方距离.* @param v1 第一个向量.* @param v2 第二个向量.* @return 两个向量之间的平方距离.*/def sqdist(v1: Vector, v2: Vector): Double = {...}

然而,这些可用方法的选择是有限的 - 事实上包括基本的操作,包括元素明智的加法、减法、乘法等.

所以这是我能看到的最好的:

  • 将向量转换为微风:
  • 在微风中执行向量运算
  • 从微风转换回 mllib 向量

这是一些示例代码:

val v1 = Vectors.dense(1.0, 2.0, 3.0)val v2 = Vectors.dense(4.0, 5.0, 6.0)val bv1 = 新的 DenseVector(v1.toArray)val bv2 = 新的 DenseVector(v2.toArray)val vectout = Vectors.dense((bv1 + bv2).toArray)向量输出:org.apache.spark.mllib.linalg.Vector = [5.0,7.0,9.0]

I need addition of two matrices that are stored in two files.

The content of latest1.txt and latest2.txt has the next str:

1 2 3
4 5 6
7 8 9

I am reading those files as follows:

scala> val rows = sc.textFile("latest1.txt").map { line => val values = line.split(‘ ‘).map(_.toDouble)
    Vectors.sparse(values.length,values.zipWithIndex.map(e => (e._2, e._1)).filter(_._2 != 0.0))
}

scala> val r1 = rows
r1: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[2] at map at :14

scala> val rows = sc.textFile("latest2.txt").map { line => val values = line.split(‘ ‘).map(_.toDouble)
    Vectors.sparse(values.length,values.zipWithIndex.map(e => (e._2, e._1)).filter(_._2 != 0.0))
}

scala> val r2 = rows
r2: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[2] at map at :14

I want to add r1, r2. So, Is there any way to add this two RDD[mllib.linalg.Vector]s in Apache-Spark.

解决方案

This is actually a good question. I work with mllib regularly and did not realize these basic linear algebra operations are not easily accessible.

The point is that the underlying breeze vectors have all of the linear algebra manipulations you would expect - including of course basic element wise addition that you specifically mentioned.

However the breeze implementation is hidden from the outside world via:

[private mllib]

So then, from the outside world/public API perspective, how do we access those primitives?

Some of them are already exposed: e.g. sum of squares:

/**
 * Returns the squared distance between two Vectors.
 * @param v1 first Vector.
 * @param v2 second Vector.
 * @return squared distance between two Vectors.
 */
def sqdist(v1: Vector, v2: Vector): Double = { 
  ...
}

However the selection of such available methods is limited - and in fact does not include the basic operations including element wise addition, subtraction, multiplication, etc.

So here is the best I could see:

  • Convert the vectors to breeze:
  • Perform the vector operations in breeze
  • Convert back from breeze to mllib Vector

Here is some sample code:

val v1 = Vectors.dense(1.0, 2.0, 3.0)
val v2 = Vectors.dense(4.0, 5.0, 6.0)
val bv1 = new DenseVector(v1.toArray)
val bv2 = new DenseVector(v2.toArray)

val vectout = Vectors.dense((bv1 + bv2).toArray)
vectout: org.apache.spark.mllib.linalg.Vector = [5.0,7.0,9.0]

这篇关于添加两个 RDD[mllib.linalg.Vector]'s的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆