Spark Scala中的点产品 [英] Dot product in Spark Scala
问题描述
我在Spark Scala中有两个数据框,其中每个数据框的第二列都是数字数组
val data22 = Seq((1,List(0.693147,0.6931471)),(2,List(0.69314,0.0)),(3,List(0.0,0.693147))).toDF("ID","tf_idf")data22.show(truncate = false)+ --- + --------------------- +| ID | tf_idf |+ --- + --------------------- +| 1 | [0.693,0.702] || 2 | [0.69314,0.0] || 3 | [0.0,0.693147] |+ --- + --------------------- +val data12 = Seq((1,List(0.69314,0.6931471))).toDF("ID","tf_idf")data12.show(truncate = false)+ --- + -------------------- +| ID | tf_idf |+ --- + -------------------- +| 1 | [0.693,0.805] |+ --- + -------------------- +
我需要在这两个数据框中执行行之间的点积.那就是我需要将 data12
中的 tf_idf
数组与 data22
中的 tf_idf
的每一行相乘.
(例如:点积的第一行应为:0.693 * 0.693 + 0.702 * 0.805
第二行:0.69314 * 0.693 + 0.0 * 0.805
第三行:0.0 * 0.693 + 0.693147 * 0.805)
基本上我想要一些东西(例如矩阵乘法) data22
* transpose(data12)
如果有人可以建议在Spark Scala中执行此操作的方法,我将不胜感激.
谢谢
解决方案如下所示:
scala>val data22 = Seq((1,List(0.693147,0.6931471)),(2,List(0.69314,0.0)),(3,List(0.0,0.693147))).toDF("ID","tf_idf")data22:org.apache.spark.sql.DataFrame = [ID:int,tf_idf:array< double>]斯卡拉>val data12 = Seq((1,List(0.69314,0.6931471))).toDF("ID","tf_idf")data12:org.apache.spark.sql.DataFrame = [ID:int,tf_idf:array< double>]斯卡拉>val arrayDot = data12.take(1).map(row =>(row.getAs [Int](0),row.getAs [WrappedArray [Double]](1).toSeq))arrayDot:Array [(Int,Seq [Double])] = Array((1,WrappedArray(0.69314,0.6931471)))斯卡拉>val dotColumn = arrayDot(0)._ 2dotColumn:Seq [Double] = WrappedArray(0.69314,0.6931471)斯卡拉>val dotUdf = udf((y:Seq [Double])=> y zip dotColumn map(z => z._1 * z._2)reduce(_ + _))dotUdf:org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(< function1>,DoubleType,Some(List(ArrayType(DoubleType,false)))))斯卡拉>data22.withColumn("dotProduct",dotUdf('tf_idf)).show+ --- + -------------------- + ------------------- +|ID |tf_idf |dotProduct |+ --- + -------------------- + ------------------- +|1 | [0.693147,0.6931 ... |0.96090081381841 ||2 |[0.69314,0.0] | 0.48044305959999994 ||3 |[0.0,0.693147] |0.4804528329237 |+ --- + -------------------- + ------------------- +
请注意,它会将 data12
中的 tf_idf
数组乘以 data22
中的每一行 tf_idf
./p>
让我知道是否有帮助!
I have two data frames in Spark Scala where the second column of each data frame is an array of numbers
val data22= Seq((1,List(0.693147,0.6931471)),(2,List(0.69314, 0.0)),(3,List(0.0, 0.693147))).toDF("ID","tf_idf")
data22.show(truncate=false)
+---+---------------------+
|ID |tf_idf |
+---+---------------------+
|1 |[0.693, 0.702] |
|2 |[0.69314, 0.0] |
|3 |[0.0, 0.693147] |
+---+---------------------+
val data12= Seq((1,List(0.69314,0.6931471))).toDF("ID","tf_idf")
data12.show(truncate=false)
+---+--------------------+
|ID |tf_idf |
+---+--------------------+
|1 |[0.693, 0.805] |
+---+--------------------+
I need to perform the dot product between rows in this two data frames. That is I need to multiply the tf_idf
array in data12
with each row of tf_idf
in data22
.
(Ex: The first row in dot product should be like this : 0.693*0.693 + 0.702*0.805
Second row : 0.69314*0.693 + 0.0*0.805
Third row : 0.0*0.693 + 0.693147*0.805 )
Basically I want something(like matrix multiplication) data22
*transpose(data12)
I would be grateful if someone can suggest a method to do this in Spark Scala .
Thank you
The solution is shown below:
scala> val data22= Seq((1,List(0.693147,0.6931471)),(2,List(0.69314, 0.0)),(3,List(0.0, 0.693147))).toDF("ID","tf_idf")
data22: org.apache.spark.sql.DataFrame = [ID: int, tf_idf: array<double>]
scala> val data12= Seq((1,List(0.69314,0.6931471))).toDF("ID","tf_idf")
data12: org.apache.spark.sql.DataFrame = [ID: int, tf_idf: array<double>]
scala> val arrayDot = data12.take(1).map(row => (row.getAs[Int](0), row.getAs[WrappedArray[Double]](1).toSeq))
arrayDot: Array[(Int, Seq[Double])] = Array((1,WrappedArray(0.69314, 0.6931471)))
scala> val dotColumn = arrayDot(0)._2
dotColumn: Seq[Double] = WrappedArray(0.69314, 0.6931471)
scala> val dotUdf = udf((y: Seq[Double]) => y zip dotColumn map(z => z._1*z._2) reduce(_ + _))
dotUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,DoubleType,Some(List(ArrayType(DoubleType,false))))
scala> data22.withColumn("dotProduct", dotUdf('tf_idf)).show
+---+--------------------+-------------------+
| ID| tf_idf| dotProduct|
+---+--------------------+-------------------+
| 1|[0.693147, 0.6931...| 0.96090081381841|
| 2| [0.69314, 0.0]|0.48044305959999994|
| 3| [0.0, 0.693147]| 0.4804528329237|
+---+--------------------+-------------------+
Note that it multiplies multiply the tf_idf
array in data12
with each row of tf_idf
in data22
.
Let me know if it helps!!
这篇关于Spark Scala中的点产品的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!