在Spark SQL中自动优雅地展平DataFrame [英] Automatically and Elegantly flatten DataFrame in Spark SQL
问题描述
全部
是否存在一种优雅且可以接受的方式来使用嵌套StructType
Is there an elegant and accepted way to flatten a Spark SQL table (Parquet) with columns that are of nested StructType
例如
如果我的模式是:
foo
|_bar
|_baz
x
y
z
如何在不依靠手动运行的情况下将其选择为扁平的表格形式
How do I select it into a flattened tabular form without resorting to manually running
df.select("foo.bar","foo.baz","x","y","z")
换句话说,我如何以编程方式仅给出StructType
和DataFrame
In other words, how do I obtain the result of the above code programmatically given just a StructType
and a DataFrame
推荐答案
最简单的答案是,没有可接受的"方法来执行此操作,但是您可以使用生成select(...)
语句的递归函数非常优雅地完成此操作通过浏览DataFrame.schema
.
The short answer is, there's no "accepted" way to do this, but you can do it very elegantly with a recursive function that generates your select(...)
statement by walking through the DataFrame.schema
.
递归函数应返回Array[Column]
.每次函数按下StructType
时,它将调用自身并将返回的Array[Column]
附加到其自己的Array[Column]
.
The recursive function should return an Array[Column]
. Every time the function hits a StructType
, it would call itself and append the returned Array[Column]
to its own Array[Column]
.
类似的东西:
import org.apache.spark.sql.Column
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.functions.col
def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenSchema(st, colName)
case _ => Array(col(colName))
}
})
}
然后您将像这样使用它:
You would then use it like this:
df.select(flattenSchema(df.schema):_*)
这篇关于在Spark SQL中自动优雅地展平DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!