使用PySpark将嵌套的JSON解析为Spark DataFrame [英] Parsing Nested JSON into a Spark DataFrame Using PySpark

查看：399 发布时间：2020/10/16 20:01:27 apache-spark pyspark apache-spark-sql databricks

本文介绍了使用PySpark将嵌套的JSON解析为Spark DataFrame的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我真的很喜欢使用PySpark-SQL解析嵌套JSON数据的帮助。数据具有以下模式（空格是出于保密目的而进行的编辑...）

I would really love some help with parsing nested JSON data using PySpark-SQL. The data has the following schema (blank spaces are edits for confidentiality purposes...)

模式

root
 |-- location_info: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- restaurant_type: string (nullable = true)
 |    |    |
 |    |    |
 |    |    |-- other_data: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- other_data_1 string (nullable = true)
 |    |    |    |    |-- other_data_2: string (nullable = true)
 |    |    |    |    |-- other_data_3: string (nullable = true)
 |    |    |    |    |-- other_data_4: string (nullable = true)
 |    |    |    |    |-- other_data_5: string (nullable = true)
 |    |    |
 |    |    |-- latitude: string (nullable = true)
 |    |    |
 |    |    |
 |    |    |
 |    |    |
 |    |    |
 |    |    |-- longitude: string (nullable = true)
 |    |    |
 |    |    |
 |    |    |
 |    |    |-- timezone: string (nullable = true)
 |-- restaurant_id: string (nullable = true)

我的目标
我基本上想将数据放入以下数据框

My Goal I would essentially want to get the data into the following data frame

restaurant_id | latitude | longtitude | timezone

我尝试过

Databricks笔记本教程

以下代码

dfj = spark.read.option("multiLine", False).json("/file/path")

result = dfj.select(col('restaurant_id'),
  explode(col('location_info')).alias('location_info') )

# SQL operation
result.createOrReplaceTempView('result')

subset_data = spark.sql(
'''
SELECT restaurant_id, location_info.latitude,location_info.longitude,location_info.timestamp  
FROM result

'''
).show()  

# Also tried this to read in
source_df_1 = spark.read.json(sc.wholeTextFiles("/file/path")
          .values()
          .flatMap(lambda x: x
                   .replace("{", "#!#")
                   .split("#!#")))

但奇怪的是，它给了我以下内容仅适用于第一个对象或餐厅ID

But oddly enough it gives me the following only for the first object or restaurant id

+-------+-----------+------------+--------------------+
|restaurant_id|latitude|longitude|timestamp|
+-------+-----------+------------+--------------------+
| 25|2.0|-8.0|2020-03-06T03:00:...|
| 25|2.0|-8.0|2020-03-06T03:00:...|
| 25|2.0|-8.0|2020-03-06T03:00:...|
| 25|2.0|-8.0|2020-03-06T03:01:...|
| 25|2.0|-8.0|2020-03-06T03:01:...|
+-------+-----------+------------+--------------------+

我的研究表明这可能有些与在源头构造JSON文件的方式有关。例如：

My research indicated that this may have something to do with the way JSON files are structured at the source. For example:

{}{
}{
}

因此不会出现多行或其他情况。

Thereby not being multi-Line or something. Wondering what to do about this as well?

非常感谢您阅读，任何帮助将不胜感激。我知道我总是可以依靠SO来提供帮助的

Thank you very much for reading, any help would really be appreciated. I know I can always count on SO to be helpful

使用PySpark将嵌套的JSON解析为Spark DataFrame [英] Parsing Nested JSON into a Spark DataFrame Using PySpark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用PySpark将嵌套的JSON解析为Spark DataFrame [英] Parsing Nested JSON into a Spark DataFrame Using PySpark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭