展平嵌套的 Spark 数据框 [英] Flatten Nested Spark Dataframe

查看：28 发布时间：2021/11/14 21:23:02 apache-spark pyspark spark-dataframe

本文介绍了展平嵌套的 Spark 数据框的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有没有办法展平任意嵌套的 Spark Dataframe?我看到的大部分工作都是针对特定模式编写的，我希望能够将具有不同嵌套类型(例如 StructType、ArrayType、MapType 等)的 Dataframe 一般地展平.

Is there a way to flatten an arbitrarily nested Spark Dataframe? Most of the work I'm seeing is written for specific schema, and I'd like to be able to generically flatten a Dataframe with different nested types (e.g. StructType, ArrayType, MapType, etc).

假设我有一个类似的架构:

Say I have a schema like:

StructType(List(StructField(field1,...), StructField(field2,...), ArrayType(StructType(List(StructField(nested_field1,...), StructField(nested_field2,...)),nested_array,...)))

希望将其调整为具有以下结构的平面表:

Looking to adapt this into a flat table with a structure like:

field1
field2
nested_array.nested_field1
nested_array.nested_field2

仅供参考，为 Pyspark 寻找建议，但也欢迎其他风格的 Spark.

FYI, looking for suggestions for Pyspark, but other flavors of Spark are also appreciated.

推荐答案

这个问题可能有点老了，但是对于那些仍在寻找解决方案的人来说，您可以使用 select * 内联扁平化复杂的数据类型:

This issue might be a bit old, but for anyone out there still looking for a solution you can flatten complex data types inline using select *:

首先让我们创建嵌套的数据框:

first let's create the nested dataframe:

from pyspark.sql import HiveContext
hc = HiveContext(sc)
nested_df = hc.read.json(sc.parallelize(["""
{
  "field1": 1, 
  "field2": 2, 
  "nested_array":{
     "nested_field1": 3,
     "nested_field2": 4
  }
}
"""]))

现在将其展平:

flat_df = nested_df.select("field1", "field2", "nested_array.*")

您会在此处找到有用的示例:https://docs.databricks.com/delta/data-transformation/complex-types.html

You'll find useful examples here: https://docs.databricks.com/delta/data-transformation/complex-types.html

如果嵌套数组太多，可以使用:

If you have too many nested arrays, you can use:

flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
flat_df = nested_df.select(*flat_cols, *[c + ".*" for c in nested_cols])

这篇关于展平嵌套的 Spark 数据框的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

展平嵌套的 Spark 数据框 [英] Flatten Nested Spark Dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

展平嵌套的 Spark 数据框 [英] Flatten Nested Spark Dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭