如何从pyspark中的spark数据帧行转换具有多个键的JSON字符串? [英] How to transform JSON string with multiple keys, from spark data frame rows in pyspark?

查看:68
本文介绍了如何从pyspark中的spark数据帧行转换具有多个键的JSON字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找帮助,如何使用多个指向json结构的键来解析json字符串,请参见所需的输出.

I'm looking for a help, how to parse json string with multiple keys to json struct, see required output.

下面的答案显示了如何使用一个 Id 转换JSON字符串:

Answer below shows how to transform JSON string with one Id :

  • jstr1 ='{"id_1":\ [{{a:1,1," b:2},{" a:3," b:4} \]}'

当每个字符串中每个JSON字符串的ID数量发生变化时,如何转换 jstr1 jstr2 中的数千个ID.

How to transform thousands of Ids in jstr1, jstr2, when number of Ids per JSON string change in each string.

当前代码:

jstr1 = """
        {"id_1": [{"a": 1, "b": 2}, {"a": 3, "b": 4}], 
        "id_2": [{"a": 5, "b": 6}, {"a": 7, "b": 8}]}
              """
jstr2 = """
        {"id_3": [{"a": 9, "b": 10}, {"a": 11, "b": 12}], 
         "id_4": [{"a": 12, "b": 14}, {"a": 15, "b": 16}],
         "id_5": [{"a": 17, "b": 18}, {"a": 19, "b": 10}]}
          """

schema = "map<string, array<struct<a:int,b:int>>>"

df = sqlContext.createDataFrame([Row(json=jstr1),Row(json=jstr2)]) \
    .withColumn('json', F.from_json(F.col('json'), schema))

output = df.withColumn("id", F.map_keys("json").getItem(0)) \
            .withColumn("json", F.map_values("json").getItem(0))
output.show(truncate=False)

当前输出:

+-------------------+----+
|json               |id  |
+-------------------+----+
|[[1, 2], [3, 4]]   |id_1|
|[[9, 10], [11, 12]]|id_3|
+-------------------+----+

必填输出:

+---------------------+------+
|         json        |  id  |
+---------------------+------+
|[[[1, 2], [3, 4]]]   | id_1 |
|[[[5, 6], [7, 8]]]   | id_2 |
|[[[9,10], [11,12]]]  | id_3 |
|[[[13,14], [15,16]]] | id_4 |
|[[[17,18], [19,20]]] | id_5 |
+---------------------+------+

# NOTE: There is a large number of Ids in each JSON string
# so hard coded getItem(0), getItem(1) ... is not valid solution
                      ...
|[[[1000,1001], [10002,1003 ]]] | id_100000 |
+-------------------------------+-----------+ 

推荐答案

地图列的爆炸将完​​成此任务:

An explode of the map column will do the job:

import pyspark.sql.functions as F

df.select(F.explode('json').alias('id', 'json')).show()
+----+--------------------+
|  id|                json|
+----+--------------------+
|id_1|    [[1, 2], [3, 4]]|
|id_2|    [[5, 6], [7, 8]]|
|id_3| [[9, 10], [11, 12]]|
|id_4|[[12, 14], [15, 16]]|
|id_5|[[17, 18], [19, 10]]|
+----+--------------------+

要获得上一个问题中的其他所需输出,您可以再爆炸一次.这次您要爆炸来自映射值的数组列.

To achieve the other desired output in your previous question, you can explode one more time. This time you explode the array column, which came from the value of the map.

df.select(
    F.explode('json').alias('id', 'json')
).select(
    'id', F.explode('json').alias('json')
).select(
    'id', 'json.*'
).show()
+----+---+---+
|  id|  a|  b|
+----+---+---+
|id_1|  1|  2|
|id_1|  3|  4|
|id_2|  5|  6|
|id_2|  7|  8|
|id_3|  9| 10|
|id_3| 11| 12|
|id_4| 12| 14|
|id_4| 15| 16|
|id_5| 17| 18|
|id_5| 19| 10|
+----+---+---+

这篇关于如何从pyspark中的spark数据帧行转换具有多个键的JSON字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆