Spark Scala中的点产品 [英] Dot product in Spark Scala

查看:67
本文介绍了Spark Scala中的点产品的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Spark Scala中有两个数据框,其中每个数据框的第二列都是数字数组

  val data22 = Seq((1,List(0.693147,0.6931471)),(2,List(0.69314,0.0)),(3,List(0.0,0.693147))).toDF("ID","tf_idf")data22.show(truncate = false)+ --- + --------------------- +| ID | tf_idf |+ --- + --------------------- +| 1 | [0.693,0.702] || 2 | [0.69314,0.0] || 3 | [0.0,0.693147] |+ --- + --------------------- +val data12 = Seq((1,List(0.69314,0.6931471))).toDF("ID","tf_idf")data12.show(truncate = false)+ --- + -------------------- +| ID | tf_idf |+ --- + -------------------- +| 1 | [0.693,0.805] |+ --- + -------------------- + 

我需要在这两个数据框中执行行之间的点积.那就是我需要将 data12 中的 tf_idf 数组与 data22 中的 tf_idf 的每一行相乘.

(例如:点积的第一行应为:0.693 * 0.693 + 0.702 * 0.805

第二行:0.69314 * 0.693 + 0.0 * 0.805

第三行:0.0 * 0.693 + 0.693147 * 0.805)

基本上我想要一些东西(例如矩阵乘法) data22 * transpose(data12)如果有人可以建议在Spark Scala中执行此操作的方法,我将不胜感激.

谢谢

解决方案

解决方案如下所示:

  scala>val data22 = Seq((1,List(0.693147,0.6931471)),(2,List(0.69314,0.0)),(3,List(0.0,0.693147))).toDF("ID","tf_idf")data22:org.apache.spark.sql.DataFrame = [ID:int,tf_idf:array< double>]斯卡拉>val data12 = Seq((1,List(0.69314,0.6931471))).toDF("ID","tf_idf")data12:org.apache.spark.sql.DataFrame = [ID:int,tf_idf:array< double>]斯卡拉>val arrayDot = data12.take(1).map(row =>(row.getAs [Int](0),row.getAs [WrappedArray [Double]](1).toSeq))arrayDot:Array [(Int,Seq [Double])] = Array((1,WrappedArray(0.69314,0.6931471)))斯卡拉>val dotColumn = arrayDot(0)._ 2dotColumn:Seq [Double] = WrappedArray(0.69314,0.6931471)斯卡拉>val dotUdf = udf((y:Seq [Double])=> y z​​ip dotColumn map(z => z._1 * z._2)reduce(_ + _))dotUdf:org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(< function1>,DoubleType,Some(List(ArrayType(DoubleType,false)))))斯卡拉>data22.withColumn("dotProduct",dotUdf('tf_idf)).show+ --- + -------------------- + ------------------- +|ID |tf_idf |dotProduct |+ --- + -------------------- + ------------------- +|1 | [0.693147,0.6931 ... |0.96090081381841 ||2 |[0.69314,0.0] | 0.48044305959999994 ||3 |[0.0,0.693147] |0.4804528329237 |+ --- + -------------------- + ------------------- + 

请注意,它会将 data12 中的 tf_idf 数组乘以 data22 中的每一行 tf_idf ./p>

让我知道是否有帮助!

I have two data frames in Spark Scala where the second column of each data frame is an array of numbers

val data22= Seq((1,List(0.693147,0.6931471)),(2,List(0.69314, 0.0)),(3,List(0.0, 0.693147))).toDF("ID","tf_idf")
data22.show(truncate=false)

+---+---------------------+
|ID |tf_idf               |
+---+---------------------+
|1  |[0.693, 0.702]       |
|2  |[0.69314, 0.0]       |
|3  |[0.0, 0.693147]      |
+---+---------------------+



val data12= Seq((1,List(0.69314,0.6931471))).toDF("ID","tf_idf")
data12.show(truncate=false)

+---+--------------------+
|ID |tf_idf              |
+---+--------------------+
|1  |[0.693, 0.805]      |
+---+--------------------+

I need to perform the dot product between rows in this two data frames. That is I need to multiply the tf_idf array in data12 with each row of tf_idf in data22.

(Ex: The first row in dot product should be like this : 0.693*0.693 + 0.702*0.805

Second row : 0.69314*0.693 + 0.0*0.805

Third row : 0.0*0.693 + 0.693147*0.805 )

Basically I want something(like matrix multiplication) data22*transpose(data12) I would be grateful if someone can suggest a method to do this in Spark Scala .

Thank you

解决方案

The solution is shown below:

scala> val data22= Seq((1,List(0.693147,0.6931471)),(2,List(0.69314, 0.0)),(3,List(0.0, 0.693147))).toDF("ID","tf_idf")
data22: org.apache.spark.sql.DataFrame = [ID: int, tf_idf: array<double>]

scala> val data12= Seq((1,List(0.69314,0.6931471))).toDF("ID","tf_idf")
data12: org.apache.spark.sql.DataFrame = [ID: int, tf_idf: array<double>]

scala> val arrayDot = data12.take(1).map(row => (row.getAs[Int](0), row.getAs[WrappedArray[Double]](1).toSeq))
arrayDot: Array[(Int, Seq[Double])] = Array((1,WrappedArray(0.69314, 0.6931471)))

scala> val dotColumn = arrayDot(0)._2
dotColumn: Seq[Double] = WrappedArray(0.69314, 0.6931471)

scala> val dotUdf = udf((y: Seq[Double]) => y zip dotColumn map(z => z._1*z._2) reduce(_ + _))
dotUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,DoubleType,Some(List(ArrayType(DoubleType,false))))

scala> data22.withColumn("dotProduct", dotUdf('tf_idf)).show
+---+--------------------+-------------------+
| ID|              tf_idf|         dotProduct|
+---+--------------------+-------------------+
|  1|[0.693147, 0.6931...|   0.96090081381841|
|  2|      [0.69314, 0.0]|0.48044305959999994|
|  3|     [0.0, 0.693147]|    0.4804528329237|
+---+--------------------+-------------------+

Note that it multiplies multiply the tf_idf array in data12 with each row of tf_idf in data22.

Let me know if it helps!!

这篇关于Spark Scala中的点产品的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆