将数据框中的向量列转换回数组列 [英] Converting a vector column in a dataframe back into an array column
本文介绍了将数据框中的向量列转换回数组列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个包含两列的数据框,其中一列(称为 dist)是一个密集向量.如何将其转换回整数数组列.
I have a dataframe with two columns one of which (called dist) is a dense vector. How can I convert it back into an array column of integers.
+---+-----+
| id| dist|
+---+-----+
|1.0|[2.0]|
|2.0|[4.0]|
|3.0|[6.0]|
|4.0|[8.0]|
+---+-----+
我尝试使用以下 udf 的几种变体,但它返回类型不匹配错误
I tried using several variants of the following udf but it returns a type mismatch error
val toInt4 = udf[Int, Vector]({ (a) => (a)})
val result = df.withColumn("dist", toDf4(df("dist"))).select("dist")
推荐答案
我认为最简单的方法是转到 RDD API,然后返回.
I think it's easiest to do it by going to the RDD API and then back.
import org.apache.spark.mllib.linalg.DenseVector
import org.apache.spark.sql.DataFrame
import org.apache.spark.rdd.RDD
import sqlContext._
// The original data.
val input: DataFrame =
sc.parallelize(1 to 4)
.map(i => i.toDouble -> new DenseVector(Array(i.toDouble * 2)))
.toDF("id", "dist")
// Turn it into an RDD for manipulation.
val inputRDD: RDD[(Double, DenseVector)] =
input.map(row => row.getAs[Double]("id") -> row.getAs[DenseVector]("dist"))
// Change the DenseVector into an integer array.
val outputRDD: RDD[(Double, Array[Int])] =
inputRDD.mapValues(_.toArray.map(_.toInt))
// Go back to a DataFrame.
val output = outputRDD.toDF("id", "dist")
output.show
你得到:
+---+----+
| id|dist|
+---+----+
|1.0| [2]|
|2.0| [4]|
|3.0| [6]|
|4.0| [8]|
+---+----+
这篇关于将数据框中的向量列转换回数组列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文