如何从pyspark中的spark数据帧行转换具有多个键的JSON字符串? [英] How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

查看：68 发布时间：2021/4/8 19:28:04 python json apache-spark pyspark apache-spark-sql

本文介绍了如何从pyspark中的spark数据帧行转换具有多个键的JSON字符串?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在寻找帮助，如何使用多个指向json结构的键来解析json字符串，请参见所需的输出.

I'm looking for a help, how to parse json string with multiple keys to json struct, see required output.

下面的答案显示了如何使用一个 Id 转换JSON字符串:

Answer below shows how to transform JSON string with one Id :

jstr1 ='{"id_1":\ [{{a:1，1，" b:2}，{" a:3，" b:4} \]}'

当每个字符串中每个JSON字符串的ID数量发生变化时，如何转换 jstr1 ， jstr2 中的数千个ID.

How to transform thousands of Ids in jstr1, jstr2, when number of Ids per JSON string change in each string.

当前代码:

jstr1 = """
        {"id_1": [{"a": 1, "b": 2}, {"a": 3, "b": 4}], 
        "id_2": [{"a": 5, "b": 6}, {"a": 7, "b": 8}]}
              """
jstr2 = """
        {"id_3": [{"a": 9, "b": 10}, {"a": 11, "b": 12}], 
         "id_4": [{"a": 12, "b": 14}, {"a": 15, "b": 16}],
         "id_5": [{"a": 17, "b": 18}, {"a": 19, "b": 10}]}
          """

schema = "map<string, array<struct<a:int,b:int>>>"

df = sqlContext.createDataFrame([Row(json=jstr1),Row(json=jstr2)]) \
    .withColumn('json', F.from_json(F.col('json'), schema))

output = df.withColumn("id", F.map_keys("json").getItem(0)) \
            .withColumn("json", F.map_values("json").getItem(0))
output.show(truncate=False)

当前输出:

+-------------------+----+
|json               |id  |
+-------------------+----+
|[[1, 2], [3, 4]]   |id_1|
|[[9, 10], [11, 12]]|id_3|
+-------------------+----+

必填输出:

+---------------------+------+
|         json        |  id  |
+---------------------+------+
|[[[1, 2], [3, 4]]]   | id_1 |
|[[[5, 6], [7, 8]]]   | id_2 |
|[[[9,10], [11,12]]]  | id_3 |
|[[[13,14], [15,16]]] | id_4 |
|[[[17,18], [19,20]]] | id_5 |
+---------------------+------+

# NOTE: There is a large number of Ids in each JSON string
# so hard coded getItem(0), getItem(1) ... is not valid solution
                      ...
|[[[1000,1001], [10002,1003 ]]] | id_100000 |
+-------------------------------+-----------+

推荐答案

地图列的爆炸将完成此任务:

An explode of the map column will do the job:

import pyspark.sql.functions as F

df.select(F.explode('json').alias('id', 'json')).show()
+----+--------------------+
|  id|                json|
+----+--------------------+
|id_1|    [[1, 2], [3, 4]]|
|id_2|    [[5, 6], [7, 8]]|
|id_3| [[9, 10], [11, 12]]|
|id_4|[[12, 14], [15, 16]]|
|id_5|[[17, 18], [19, 10]]|
+----+--------------------+

要获得上一个问题中的其他所需输出，您可以再爆炸一次.这次您要爆炸来自映射值的数组列.

To achieve the other desired output in your previous question, you can explode one more time. This time you explode the array column, which came from the value of the map.

df.select(
    F.explode('json').alias('id', 'json')
).select(
    'id', F.explode('json').alias('json')
).select(
    'id', 'json.*'
).show()
+----+---+---+
|  id|  a|  b|
+----+---+---+
|id_1|  1|  2|
|id_1|  3|  4|
|id_2|  5|  6|
|id_2|  7|  8|
|id_3|  9| 10|
|id_3| 11| 12|
|id_4| 12| 14|
|id_4| 15| 16|
|id_5| 17| 18|
|id_5| 19| 10|
+----+---+---+

这篇关于如何从pyspark中的spark数据帧行转换具有多个键的JSON字符串?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何从pyspark中的spark数据帧行转换具有多个键的JSON字符串? [英] How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何从pyspark中的spark数据帧行转换具有多个键的JSON字符串? [英] How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭