在 PySpark 中如何解析嵌入的 JSON [英] In PySpark how to parse an embedded JSON
问题描述
我是 PySpark 的新手.
I am new to PySpark.
我有一个具有以下架构的 JSON 文件
I have a JSON file which has below schema
df = spark.read.json(input_file)
df.printSchema()
|-- UrlsInfo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- displayUrl: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- url: string (nullable = true)
|-- type: long (nullable = true)
我想要一个新的结果数据框,它应该只有两列 type 和 UrlsInfo.element.DisplayUrl
I want a new result dataframe which should have only two columns type and UrlsInfo.element.DisplayUrl
这是我的尝试代码,它没有给出预期的输出
This is my try code, which doesn't give the expected output
df.createOrReplaceTempView("the_table")
resultDF = spark.sql("SELECT type, UrlsInfo.element.DisplayUrl FROM the_table")
resultDF.show()
我希望 resultDF 是这样的:
I want resultDF to be something like this:
Type | DisplayUrl
----- ------------
2 | http://example.com
这是相关的 Pyspark 中的 JSON 文件解析,但没有回答我的问题.
This is related JSON file parsing in Pyspark, but doesn't answer my question.
推荐答案
正如您在架构中所见,UrlsInfo
是一种数组类型,而不是结构.因此,元素"架构项不是指命名属性(您尝试通过 .element
访问它),而是指数组元素(它响应像 [0] 这样的索引)
).
As you can see in your schema, UrlsInfo
is an array type, not a struct. The "element" schema item thus refers not to a named property (you're trying to access it by .element
) but to an array element (which responds to an index like [0]
).
我已经手工复制了您的架构:
I've reproduced your schema by hand:
from pyspark.sql import Row
df = spark.createDataFrame([Row(UrlsInfo=[Row(displayUri="http://example.com", type="narf", url="poit")], Type=2)])
df.printSchema()
root
|-- Type: long (nullable = true)
|-- UrlsInfo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- displayUri: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- url: string (nullable = true)
并且我可以通过使用索引来生成您似乎正在寻找的表格:
and I'm able to produce a table like what you seem to be looking for by using an index:
df.createOrReplaceTempView("temp")
resultDF = spark.sql("SELECT type, UrlsInfo[0].DisplayUri FROM temp")
resultDF.show()
+----+----------------------+
|type|UrlsInfo[0].DisplayUri|
+----+----------------------+
| 2| http://example.com|
+----+----------------------+
然而,这仅给出第二列中 UrlsInfo
的第一个元素(如果有).
However, this only gives the first element (if any) of UrlsInfo
in the second column.
我忘记了 EXPLODE
函数,您可以在此处使用该函数将 UrlsInfo
元素视为一组行:
I'd forgotten about the EXPLODE
function, which you can use here to treat the UrlsInfo
elements like a set of rows:
from pyspark.sql import Row
df = spark.createDataFrame([Row(UrlsInfo=[Row(displayUri="http://example.com", type="narf", url="poit"), Row(displayUri="http://another-example.com", type="narf", url="poit")], Type=2)])
df.createOrReplaceTempView("temp")
resultDF = spark.sql("SELECT type, EXPLODE(UrlsInfo.displayUri) AS displayUri FROM temp")
resultDF.show()
+----+--------------------+
|type| displayUri|
+----+--------------------+
| 2| http://example.com|
| 2|http://another-ex...|
+----+--------------------+
这篇关于在 PySpark 中如何解析嵌入的 JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!