如何使用 PySpark 从现有临时表中解析 json 字符串? [英] How can you parse a string that is json from an existing temp table using PySpark?

查看：35 发布时间：2021/11/14 21:56:49 apache-spark pyspark spark-dataframe

本文介绍了如何使用 PySpark 从现有临时表中解析 json 字符串?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个现有的 Spark 数据框，其中包含这样的列:

I have an existing Spark dataframe that has columns as such:

--------------------
pid | response
--------------------
 12 | {"status":"200"}

响应是一个字符串列.有没有办法将其转换为 JSON 并提取特定字段?可以像在 Hive 中一样使用横向视图吗?我在网上查找了一些使用爆炸和稍后查看的示例，但它似乎不适用于 Spark 2.1.1

response is a string column. Is there a way to cast it into JSON and extract specific fields? Can lateral view be used as it is in Hive? I looked up some examples on line that used explode and later view but it doesn't seem to work with Spark 2.1.1

推荐答案

从 pyspark.sql.functions 中，您可以使用 from_json,get_json_object,json_tuple 中的任何一个从 json 字符串中提取字段，如下所示，

From pyspark.sql.functions , you can use any of from_json,get_json_object,json_tuple to extract fields from json string as below,

>>from pyspark.sql.functions import json_tuple,from_json,get_json_object
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> l = [(12, '{"status":"200"}'),(13,'{"status":"200","somecol":"300"}')]
>>> df = spark.createDataFrame(l,['pid','response'])
>>> df.show()
+---+--------------------+
|pid|            response|
+---+--------------------+
| 12|    {"status":"200"}|
| 13|{"status":"200",...|
+---+--------------------+

>>> df.printSchema()
root
 |-- pid: long (nullable = true)
 |-- response: string (nullable = true)

Using json_tuple :
>>> df.select('pid',json_tuple(df.response,'status','somecol')).show()
+---+---+----+
|pid| c0|  c1|
+---+---+----+
| 12|200|null|
| 13|200| 300|
+---+---+----+

Using from_json:
>>> schema = StructType([StructField("status", StringType()),StructField("somecol", StringType())])
>>> df.select('pid',from_json(df.response, schema).alias("json")).show()
+---+----------+
|pid|      json|
+---+----------+
| 12|[200,null]|
| 13| [200,300]|
+---+----------+

Using get_json_object:
>>> df.select('pid',get_json_object(df.response,'$.status').alias('status'),get_json_object(df.response,'$.somecol').alias('somecol')).show()
+---+------+-------+
|pid|status|somecol|
+---+------+-------+
| 12|   200|   null|
| 13|   200|    300|
+---+------+-------+

这篇关于如何使用 PySpark 从现有临时表中解析 json 字符串?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用 PySpark 从现有临时表中解析 json 字符串? [英] How can you parse a string that is json from an existing temp table using PySpark?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用 PySpark 从现有临时表中解析 json 字符串? [英] How can you parse a string that is json from an existing temp table using PySpark?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭