Spark Scala中的点产品 [英] Dot product in Spark Scala

查看：67 发布时间：2021/4/8 19:39:02 scala apache-spark dot-product

本文介绍了Spark Scala中的点产品的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在Spark Scala中有两个数据框，其中每个数据框的第二列都是数字数组

  val data22 = Seq((1，List(0.693147,0.6931471))，(2，List(0.69314，0.0))，(3，List(0.0，0.693147))).toDF("ID"，"tf_idf")data22.show(truncate = false)+ --- + --------------------- +| ID | tf_idf |+ --- + --------------------- +| 1 | [0.693，0.702] || 2 | [0.69314，0.0] || 3 | [0.0，0.693147] |+ --- + --------------------- +val data12 = Seq((1，List(0.69314,0.6931471))).toDF("ID"，"tf_idf")data12.show(truncate = false)+ --- + -------------------- +| ID | tf_idf |+ --- + -------------------- +| 1 | [0.693，0.805] |+ --- + -------------------- +

我需要在这两个数据框中执行行之间的点积.那就是我需要将 data12 中的 tf_idf 数组与 data22 中的 tf_idf 的每一行相乘.

(例如:点积的第一行应为:0.693 * 0.693 + 0.702 * 0.805

第二行:0.69314 * 0.693 + 0.0 * 0.805

第三行:0.0 * 0.693 + 0.693147 * 0.805)

基本上我想要一些东西(例如矩阵乘法) data22 * transpose(data12)如果有人可以建议在Spark Scala中执行此操作的方法，我将不胜感激.

谢谢

解决方案

解决方案如下所示:

  scala>val data22 = Seq((1，List(0.693147,0.6931471))，(2，List(0.69314，0.0))，(3，List(0.0，0.693147))).toDF("ID"，"tf_idf")data22:org.apache.spark.sql.DataFrame = [ID:int，tf_idf:array< double>]斯卡拉>val data12 = Seq((1，List(0.69314,0.6931471))).toDF("ID"，"tf_idf")data12:org.apache.spark.sql.DataFrame = [ID:int，tf_idf:array< double>]斯卡拉>val arrayDot = data12.take(1).map(row =>(row.getAs [Int](0)，row.getAs [WrappedArray [Double]](1).toSeq))arrayDot:Array [(Int，Seq [Double])] = Array((1，WrappedArray(0.69314，0.6931471)))斯卡拉>val dotColumn = arrayDot(0)._ 2dotColumn:Seq [Double] = WrappedArray(0.69314，0.6931471)斯卡拉>val dotUdf = udf((y:Seq [Double])=> y zip dotColumn map(z => z._1 * z._2)reduce(_ + _))dotUdf:org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(< function1>，DoubleType，Some(List(ArrayType(DoubleType，false)))))斯卡拉>data22.withColumn("dotProduct"，dotUdf('tf_idf)).show+ --- + -------------------- + ------------------- +|ID |tf_idf |dotProduct |+ --- + -------------------- + ------------------- +|1 | [0.693147，0.6931 ... |0.96090081381841 ||2 |[0.69314，0.0] | 0.48044305959999994 ||3 |[0.0，0.693147] |0.4804528329237 |+ --- + -------------------- + ------------------- +

请注意，它会将 data12 中的 tf_idf 数组乘以 data22 中的每一行 tf_idf ./p>

让我知道是否有帮助！

I have two data frames in Spark Scala where the second column of each data frame is an array of numbers

val data22= Seq((1,List(0.693147,0.6931471)),(2,List(0.69314, 0.0)),(3,List(0.0, 0.693147))).toDF("ID","tf_idf")
data22.show(truncate=false)

+---+---------------------+
|ID |tf_idf               |
+---+---------------------+
|1  |[0.693, 0.702]       |
|2  |[0.69314, 0.0]       |
|3  |[0.0, 0.693147]      |
+---+---------------------+



val data12= Seq((1,List(0.69314,0.6931471))).toDF("ID","tf_idf")
data12.show(truncate=false)

+---+--------------------+
|ID |tf_idf              |
+---+--------------------+
|1  |[0.693, 0.805]      |
+---+--------------------+

I need to perform the dot product between rows in this two data frames. That is I need to multiply the tf_idf array in data12 with each row of tf_idf in data22.

(Ex: The first row in dot product should be like this : 0.693*0.693 + 0.702*0.805

Second row : 0.69314*0.693 + 0.0*0.805

Third row : 0.0*0.693 + 0.693147*0.805 )

Basically I want something(like matrix multiplication) data22*transpose(data12) I would be grateful if someone can suggest a method to do this in Spark Scala .

Thank you

解决方案

The solution is shown below:

scala> val data22= Seq((1,List(0.693147,0.6931471)),(2,List(0.69314, 0.0)),(3,List(0.0, 0.693147))).toDF("ID","tf_idf")
data22: org.apache.spark.sql.DataFrame = [ID: int, tf_idf: array<double>]

scala> val data12= Seq((1,List(0.69314,0.6931471))).toDF("ID","tf_idf")
data12: org.apache.spark.sql.DataFrame = [ID: int, tf_idf: array<double>]

scala> val arrayDot = data12.take(1).map(row => (row.getAs[Int](0), row.getAs[WrappedArray[Double]](1).toSeq))
arrayDot: Array[(Int, Seq[Double])] = Array((1,WrappedArray(0.69314, 0.6931471)))

scala> val dotColumn = arrayDot(0)._2
dotColumn: Seq[Double] = WrappedArray(0.69314, 0.6931471)

scala> val dotUdf = udf((y: Seq[Double]) => y zip dotColumn map(z => z._1*z._2) reduce(_ + _))
dotUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,DoubleType,Some(List(ArrayType(DoubleType,false))))

scala> data22.withColumn("dotProduct", dotUdf('tf_idf)).show
+---+--------------------+-------------------+
| ID|              tf_idf|         dotProduct|
+---+--------------------+-------------------+
|  1|[0.693147, 0.6931...|   0.96090081381841|
|  2|      [0.69314, 0.0]|0.48044305959999994|
|  3|     [0.0, 0.693147]|    0.4804528329237|
+---+--------------------+-------------------+

Note that it multiplies multiply the tf_idf array in data12 with each row of tf_idf in data22.

Let me know if it helps!!

这篇关于Spark Scala中的点产品的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark Scala中的点产品 [英] Dot product in Spark Scala

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark Scala中的点产品 [英] Dot product in Spark Scala

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭