在 PySpark 中如何解析嵌入的 JSON [英] In PySpark how to parse an embedded JSON

查看:83
本文介绍了在 PySpark 中如何解析嵌入的 JSON的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 PySpark 的新手.

I am new to PySpark.

我有一个具有以下架构的 JSON 文件

I have a JSON file which has below schema

df = spark.read.json(input_file)

df.printSchema()

 |-- UrlsInfo: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- displayUrl: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- url: string (nullable = true)
 |-- type: long (nullable = true)

我想要一个新的结果数据框,它应该只有两列 typeUrlsInfo.element.DisplayUrl

I want a new result dataframe which should have only two columns type and UrlsInfo.element.DisplayUrl

这是我的尝试代码,它没有给出预期的输出

This is my try code, which doesn't give the expected output

  df.createOrReplaceTempView("the_table")  
  resultDF = spark.sql("SELECT type, UrlsInfo.element.DisplayUrl FROM the_table")
  resultDF.show()

我希望 resultDF 是这样的:

I want resultDF to be something like this:

Type | DisplayUrl
----- ------------
2    | http://example.com 

这是相关的 Pyspark 中的 JSON 文件解析,但没有回答我的问题.

This is related JSON file parsing in Pyspark, but doesn't answer my question.

推荐答案

正如您在架构中所见,UrlsInfo 是一种数组类型,而不是结构.因此,元素"架构项不是指命名属性(您尝试通过 .element 访问它),而是指数组元素(它响应像 [0] 这样的索引)).

As you can see in your schema, UrlsInfo is an array type, not a struct. The "element" schema item thus refers not to a named property (you're trying to access it by .element) but to an array element (which responds to an index like [0]).

我已经手工复制了您的架构:

I've reproduced your schema by hand:

from pyspark.sql import Row
df = spark.createDataFrame([Row(UrlsInfo=[Row(displayUri="http://example.com", type="narf", url="poit")], Type=2)])
df.printSchema()

root
 |-- Type: long (nullable = true)
 |-- UrlsInfo: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- displayUri: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- url: string (nullable = true)

并且我可以通过使用索引来生成您似乎正在寻找的表格:

and I'm able to produce a table like what you seem to be looking for by using an index:

df.createOrReplaceTempView("temp")
resultDF = spark.sql("SELECT type, UrlsInfo[0].DisplayUri FROM temp")
resultDF.show()

+----+----------------------+
|type|UrlsInfo[0].DisplayUri|
+----+----------------------+
|   2|    http://example.com|
+----+----------------------+

然而,这仅给出第二列中 UrlsInfo 的第一个元素(如果有).

However, this only gives the first element (if any) of UrlsInfo in the second column.

我忘记了 EXPLODE 函数,您可以在此处使用该函数将 UrlsInfo 元素视为一组行:

I'd forgotten about the EXPLODE function, which you can use here to treat the UrlsInfo elements like a set of rows:

from pyspark.sql import Row
df = spark.createDataFrame([Row(UrlsInfo=[Row(displayUri="http://example.com", type="narf", url="poit"), Row(displayUri="http://another-example.com", type="narf", url="poit")], Type=2)])
df.createOrReplaceTempView("temp")
resultDF = spark.sql("SELECT type, EXPLODE(UrlsInfo.displayUri) AS displayUri FROM temp")
resultDF.show()

+----+--------------------+
|type|          displayUri|
+----+--------------------+
|   2|  http://example.com|
|   2|http://another-ex...|
+----+--------------------+

这篇关于在 PySpark 中如何解析嵌入的 JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆