在 Spark SQL 中自动优雅地展平 DataFrame [英] Automatically and Elegantly flatten DataFrame in Spark SQL

查看:15
本文介绍了在 Spark SQL 中自动优雅地展平 DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

全部,

是否有一种优雅且可接受的方法来展平带有嵌套 StructType

Is there an elegant and accepted way to flatten a Spark SQL table (Parquet) with columns that are of nested StructType

例如

如果我的架构是:

foo
 |_bar
 |_baz
x
y
z

如何在不求助于手动运行的情况下将其选择为扁平表格形式

How do I select it into a flattened tabular form without resorting to manually running

df.select("foo.bar","foo.baz","x","y","z")

换句话说,我如何以编程方式获得上述代码的结果,只给出一个 StructType 和一个 DataFrame

In other words, how do I obtain the result of the above code programmatically given just a StructType and a DataFrame

推荐答案

简短的回答是,没有可接受"的方法可以做到这一点,但是您可以使用生成您的 select 的递归函数非常优雅地做到这一点(...) 语句,遍历 DataFrame.schema.

The short answer is, there's no "accepted" way to do this, but you can do it very elegantly with a recursive function that generates your select(...) statement by walking through the DataFrame.schema.

递归函数应该返回一个Array[Column].每次函数遇到 StructType 时,它都会调用自己并将返回的 Array[Column] 附加到它自己的 Array[Column].

The recursive function should return an Array[Column]. Every time the function hits a StructType, it would call itself and append the returned Array[Column] to its own Array[Column].

类似于:

import org.apache.spark.sql.Column
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.functions.col

def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
  schema.fields.flatMap(f => {
    val colName = if (prefix == null) f.name else (prefix + "." + f.name)

    f.dataType match {
      case st: StructType => flattenSchema(st, colName)
      case _ => Array(col(colName))
    }
  })
}

然后你会像这样使用它:

You would then use it like this:

df.select(flattenSchema(df.schema):_*)

这篇关于在 Spark SQL 中自动优雅地展平 DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆