将WrappedArray的Spark数据帧转换为Dataframe [Vector] [英] Spark Dataframe of WrappedArray to Dataframe[Vector]
问题描述
我有一个具有以下架构的spark数据框df
:
I have a spark Dataframe df
with the following schema:
root
|-- features: array (nullable = true)
| |-- element: double (containsNull = false)
我想创建一个新的数据框,其中的每一行都是Double
的向量,并希望获得以下架构:
I would like to create a new Dataframe where each row will be a Vector of Double
s and expecting to get the following schema:
root
|-- features: vector (nullable = true)
So far I have the following piece of code (influenced by this post: Converting Spark Dataframe(with WrappedArray) to RDD[labelPoint] in scala) but I fear something is wrong with it because it takes a very long time to compute even a reasonable amount of rows. Also, if there are too many rows the application will crash with a heap space exception.
val clustSet = df.rdd.map(r => {
val arr = r.getAs[mutable.WrappedArray[Double]]("features")
val features: Vector = Vectors.dense(arr.toArray)
features
}).map(Tuple1(_)).toDF()
在这种情况下,我怀疑arr.toArray
指令不是很好的Spark做法.任何澄清将非常有帮助.
I suspect that the instruction arr.toArray
is not a good Spark practice in this case. Any clarification would be very helpful.
谢谢!
推荐答案
这是因为.rdd
必须从内部内存格式中反序列化对象,这非常耗时.
It's because .rdd
have to unserialize objects from internal in-memory format and it is very time consuming.
可以使用.toArray
-您是在行级别进行操作,而不是将所有内容收集到驱动程序节点.
It's ok to use .toArray
- you are operating on row level, not collecting everything to the driver node.
您可以使用UDF轻松完成此操作:
You can do this very easy with UDFs:
import org.apache.spark.ml.linalg._
val convertUDF = udf((array : Seq[Double]) => {
Vectors.dense(array.toArray)
})
val withVector = dataset
.withColumn("features", convertUDF('features))
代码来自以下答案:将ArrayType(FloatType,false)转换为VectorUTD
但是问题的作者并没有询问差异
However there author of the question didn't ask about differences
这篇关于将WrappedArray的Spark数据帧转换为Dataframe [Vector]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!