Spark Scala:如何将 Dataframe[vector] 转换为 DataFrame[f1:Double, ..., fn: Double)] [英] Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)]

查看:37
本文介绍了Spark Scala:如何将 Dataframe[vector] 转换为 DataFrame[f1:Double, ..., fn: Double)]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚使用 Standard Scaler 为 ML 应用程序规范了我的功能.选择缩放特征后,我想将其转换回双精度数据帧,尽管我的向量的长度是任意的.我知道如何使用

I just used Standard Scaler to normalize my features for a ML application. After selecting the scaled features, I want to convert this back to a dataframe of Doubles, though the length of my vectors are arbitrary. I know how to do it for a specific 3 features by using

myDF.map{case Row(v: Vector) => (v(0), v(1), v(2))}.toDF("f1", "f2", "f3")

但不适用于任意数量的特征.有没有简单的方法可以做到这一点?

but not for an arbitrary amount of features. Is there an easy way to do this?

示例:

val testDF = sc.parallelize(List(Vectors.dense(5D, 6D, 7D), Vectors.dense(8D, 9D, 10D), Vectors.dense(11D, 12D, 13D))).map(Tuple1(_)).toDF("scaledFeatures")
val myColumnNames = List("f1", "f2", "f3")
// val finalDF = DataFrame[f1: Double, f2: Double, f3: Double] 

编辑

我在创建数据帧时发现了如何解包到列名,但仍然无法将向量转换为创建数据帧所需的序列:

I found out how to unpack to column names when creating the dataframe, but still am having trouble converting a vector to a sequence needed to create the dataframe:

finalDF = testDF.map{case Row(v: Vector) => v.toArray.toSeq /* <= this errors */}.toDF(List("f1", "f2", "f3"): _*)

推荐答案

Spark >= 3.0.0

从 Spark 3.0 开始,你可以使用 vector_to_array

Since Spark 3.0 you can use vector_to_array

import org.apache.spark.ml.functions.vector_to_array

testDF.select(vector_to_array($"scaledFeatures").alias("_tmp")).select(exprs:_*)

火花<3.0.0

一种可能的方法与此类似

One possible approach is something similar to this

import org.apache.spark.sql.functions.udf

// In Spark 1.x you'll will have to replace ML Vector with MLLib one
// import org.apache.spark.mllib.linalg.Vector
// In 2.x the below is usually the right choice
import org.apache.spark.ml.linalg.Vector

// Get size of the vector
val n = testDF.first.getAs[Vector](0).size

// Simple helper to convert vector to array<double> 
// asNondeterministic is available in Spark 2.3 or befor
// It can be removed, but at the cost of decreased performance
val vecToSeq = udf((v: Vector) => v.toArray).asNondeterministic

// Prepare a list of columns to create
val exprs = (0 until n).map(i => $"_tmp".getItem(i).alias(s"f$i"))

testDF.select(vecToSeq($"scaledFeatures").alias("_tmp")).select(exprs:_*)

如果您预先知道一个列列表,您可以稍微简化一下:

If you know a list of columns upfront you can simplify this a little:

val cols: Seq[String] = ???
val exprs = cols.zipWithIndex.map{ case (c, i) => $"_tmp".getItem(i).alias(c) }

对于 Python 等效项,请参阅如何将 Vector 拆分为列 - 使用 PySpark.

For Python equivalent see How to split Vector into columns - using PySpark.

这篇关于Spark Scala:如何将 Dataframe[vector] 转换为 DataFrame[f1:Double, ..., fn: Double)]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆