将WrappedArray的Spark数据帧转换为Dataframe [Vector] [英] Spark Dataframe of WrappedArray to Dataframe[Vector]

查看：264 发布时间：2020/9/4 5:01:18 scala apache-spark spark-dataframe

本文介绍了将WrappedArray的Spark数据帧转换为Dataframe [Vector]的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个具有以下架构的spark数据框df:

I have a spark Dataframe df with the following schema:

root
 |-- features: array (nullable = true)
 |    |-- element: double (containsNull = false)

我想创建一个新的数据框，其中的每一行都是Double的向量，并希望获得以下架构:

I would like to create a new Dataframe where each row will be a Vector of Doubles and expecting to get the following schema:

root
     |-- features: vector (nullable = true)

到目前为止，我有以下代码(受此职位影响:

So far I have the following piece of code (influenced by this post: Converting Spark Dataframe(with WrappedArray) to RDD[labelPoint] in scala) but I fear something is wrong with it because it takes a very long time to compute even a reasonable amount of rows. Also, if there are too many rows the application will crash with a heap space exception.

val clustSet = df.rdd.map(r => {
          val arr = r.getAs[mutable.WrappedArray[Double]]("features")
          val features: Vector = Vectors.dense(arr.toArray)
          features
          }).map(Tuple1(_)).toDF()

在这种情况下，我怀疑arr.toArray指令不是很好的Spark做法.任何澄清将非常有帮助.

I suspect that the instruction arr.toArray is not a good Spark practice in this case. Any clarification would be very helpful.

谢谢！

推荐答案

这是因为.rdd必须从内部内存格式中反序列化对象，这非常耗时.

It's because .rdd have to unserialize objects from internal in-memory format and it is very time consuming.

可以使用.toArray-您是在行级别进行操作，而不是将所有内容收集到驱动程序节点.

It's ok to use .toArray - you are operating on row level, not collecting everything to the driver node.

您可以使用UDF轻松完成此操作:

You can do this very easy with UDFs:

import org.apache.spark.ml.linalg._
val convertUDF = udf((array : Seq[Double]) => {
  Vectors.dense(array.toArray)
})
val withVector = dataset
  .withColumn("features", convertUDF('features))

代码来自以下答案:将ArrayType(FloatType，false)转换为VectorUTD

但是问题的作者并没有询问差异

However there author of the question didn't ask about differences

这篇关于将WrappedArray的Spark数据帧转换为Dataframe [Vector]的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将WrappedArray的Spark数据帧转换为Dataframe [Vector] [英] Spark Dataframe of WrappedArray to Dataframe[Vector]

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

将WrappedArray的Spark数据帧转换为Dataframe [Vector] [英] Spark Dataframe of WrappedArray to Dataframe[Vector]

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭