展平嵌套Spark数据框 [英] Flatten Nested Spark Dataframe
问题描述
是否有一种方法可以展平任意嵌套的Spark数据框?我看到的大部分工作都是针对特定的架构编写的,我希望能够通用地将具有不同嵌套类型(例如StructType,ArrayType,MapType等)的数据框展平.
说我有一个类似的架构:
StructType(List(StructField(field1,...), StructField(field2,...), ArrayType(StructType(List(StructField(nested_field1,...), StructField(nested_field2,...)),nested_array,...)))
希望将其改编成具有以下结构的平板:
field1
field2
nested_array.nested_field1
nested_array.nested_field2
FYI,正在寻找有关Pyspark的建议,但也欢迎使用其他口味的Spark.
这个问题可能有点老了,但是对于仍然在寻找解决方案的任何人,您都可以使用select *:
首先让我们创建嵌套的数据框:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
nested_df = hc.read.json(sc.parallelize(["""
{
"field1": 1,
"field2": 2,
"nested_array":{
"nested_field1": 3,
"nested_field2": 4
}
}
"""]))
现在将其压平:
flat_df = nested_df.select("field1", "field2", "nested_array.*")
您会在这里找到有用的示例: https://docs.databricks.com/delta/data-transformation/complex -types.html
如果嵌套数组太多,则可以使用:
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
flat_df = nested_df.select(*flat_cols, *[c + ".*" for c in nested_cols])
Is there a way to flatten an arbitrarily nested Spark Dataframe? Most of the work I'm seeing is written for specific schema, and I'd like to be able to generically flatten a Dataframe with different nested types (e.g. StructType, ArrayType, MapType, etc).
Say I have a schema like:
StructType(List(StructField(field1,...), StructField(field2,...), ArrayType(StructType(List(StructField(nested_field1,...), StructField(nested_field2,...)),nested_array,...)))
Looking to adapt this into a flat table with a structure like:
field1
field2
nested_array.nested_field1
nested_array.nested_field2
FYI, looking for suggestions for Pyspark, but other flavors of Spark are also appreciated.
This issue might be a bit old, but for anyone out there still looking for a solution you can flatten complex data types inline using select *:
first let's create the nested dataframe:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
nested_df = hc.read.json(sc.parallelize(["""
{
"field1": 1,
"field2": 2,
"nested_array":{
"nested_field1": 3,
"nested_field2": 4
}
}
"""]))
now to flatten it:
flat_df = nested_df.select("field1", "field2", "nested_array.*")
You'll find useful examples here: https://docs.databricks.com/delta/data-transformation/complex-types.html
If you have too many nested arrays, you can use:
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
flat_df = nested_df.select(*flat_cols, *[c + ".*" for c in nested_cols])
这篇关于展平嵌套Spark数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!