优雅的 Json 在 Spark 中展平 [英] Elegant Json flatten in Spark

查看：34 发布时间：2021/11/14 21:58:53 json scala apache-spark apache-spark-sql

本文介绍了优雅的 Json 在 Spark 中展平的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在 spark 中有以下数据框:

I have the following dataframe in spark:

val test = sqlContext.read.json(path = "/path/to/jsonfiles/*")  
test.printSchema
root
 |-- properties: struct (nullable = true)
 |    |-- prop_1: string (nullable = true)
 |    |-- prop_2: string (nullable = true)
 |    |-- prop_3: boolean (nullable = true)
 |    |-- prop_4: long (nullable = true)
...

我想做的是展平这个数据框，以便 prop_1 ... prop_n 存在于顶层.即

What I would like to do is flatten this dataframe so that the prop_1 ... prop_n exist at the top level. I.e.

test.printSchema
root
|-- prop_1: string (nullable = true)
|-- prop_2: string (nullable = true)
|-- prop_3: boolean (nullable = true)
|-- prop_4: long (nullable = true)
...

类似问题有多种解决方案.我能找到的最好的是这里.但是，解决方案仅适用于 properties 类型为 Array 的情况.就我而言，属性的类型为 StructType.

There are several solutions to similar problems. The best I can find is posed here. However, solution only works if properties is of type Array. In my case, properties is of type StructType.

另一种方法是:

test.registerTempTable("test")
val test2 = sqlContext.sql("""SELECT properties.prop_1, ... FROM test""")

但在这种情况下，我必须明确指定每一行，这是不雅的.

But in this case I have to explicitly specify each row, and that is inelegant.

解决这个问题的最佳方法是什么?

What is the best way to solve this problem?

推荐答案

如果您不是在寻找递归解决方案，那么在 1.6+ 点语法中使用 star 应该可以正常工作:

If you're not looking for a recursive solution then in 1.6+ dot syntax with star should work just fine:

val df = sqlContext.read.json(sc.parallelize(Seq(
  """{"properties": {
       "prop1": "foo", "prop2": "bar", "prop3": true, "prop4": 1}}"""
)))

df.select($"properties.*").printSchema
// root
//  |-- prop1: string (nullable = true)
//  |-- prop2: string (nullable = true)
//  |-- prop3: boolean (nullable = true)
//  |-- prop4: long (nullable = true)

不幸的是，这在 1.5 及之前的版本中不起作用.

Unfortunately this doesn't work in 1.5 and before.

在这种情况下，您可以直接从架构中提取所需的信息.您会在从 Spark DataFrame 中删除嵌套列中找到一个示例，它应该很容易调整以适应这种情况，而另一个示例一个(Python 中的递归模式展平)Pyspark:将 SchemaRDD 映射到 SchemaRDD.

In case like this you can simply extract required information directly from the schema. You'll find one example in Dropping a nested column from Spark DataFrame which should be easy to adjust to fit this scenario and another one (recursive schema flattening in Python) Pyspark: Map a SchemaRDD into a SchemaRDD.

这篇关于优雅的 Json 在 Spark 中展平的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

优雅的 Json 在 Spark 中展平 [英] Elegant Json flatten in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

优雅的 Json 在 Spark 中展平 [英] Elegant Json flatten in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭