Spark:如何解析嵌套列表的 JSON 字符串以触发数据框? [英] Spark: How to parse JSON string of nested lists to spark data frame?

查看：28 发布时间：2021/11/14 23:08:48 python apache-spark pyspark apache-spark-sql

本文介绍了Spark:如何解析嵌套列表的 JSON 字符串以触发数据框?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何解析嵌套列表的 JSON 字符串以在 pyspark 中触发数据框?

How to parse JSON string of nested lists to spark data frame in pyspark ?

输入数据框:

+-------------+-----------------------------------------------+
|url          |json                                           |
+-------------+-----------------------------------------------+
|https://url.a|[[1572393600000, 1.000],[1572480000000, 1.007]]|
|https://url.b|[[1572825600000, 1.002],[1572912000000, 1.000]]|
+-------------+-----------------------------------------------+

root
 |-- url: string (nullable = true)
 |-- json: string (nullable = true)

预期输出:

+---------------------------------------+
|col_1 | col_2               | col_3    |
+---------------------------------------+
| a    | 1572393600000       |  1.000   | 
| a    | 1572480000000       |  1.007   |
| b    | 1572825600000       |  1.002   |
| b    | 1572912000000       |  1.000   |
+---------------------------------------+

示例代码:

import pyspark
import pyspark.sql.functions as F

spark = (pyspark.sql.SparkSession.builder.appName("Downloader_standalone")
    .master('local[*]')
    .getOrCreate())

sc = spark.sparkContext
from pyspark.sql import Row

rdd_list  = [('https://url.a','[[1572393600000, 1.000],[1572480000000, 1.007]]'),
             ('https://url.b','[[1572825600000, 1.002],[1572912000000, 1.000]]')]

jsons = sc.parallelize(rdd_list) 

df = spark.createDataFrame(jsons, "url string, json string")
df.show(truncate=False)
df.printSchema()


(df.withColumn('json', F.from_json(F.col('json'),"array<string,string>"))
.select(F.explode('json').alias('col_1', 'col_2', 'col_3')).show())

有几个例子，但我不知道怎么做:

There are few examples, but I can not figure out how to do it:

如何从pyspark中的spark数据框行解析和转换json字符串

如何从 pyspark 中的火花数据帧行转换具有多个键的 JSON 字符串?

推荐答案

在字符串中进行一些替换并通过拆分可以获得所需的结果:

With some replacements in the strings and by splitting you can get the desired result:

from pyspark.sql import functions as F

df1 = df.withColumn(
    "col_1",
    F.regexp_replace("url", "https://url.", "")
).withColumn(
    "col_2_3",
    F.explode(
        F.expr("""transform(
            split(trim(both '][' from json), '\\\],\\\['), 
            x -> struct(split(x, ',')[0] as col_2, split(x, ',')[1] as col_3)
        )""")
    )
).selectExpr("col_1", "col_2_3.*")

df1.show(truncate=False)

#+-----+-------------+------+
#|col_1|col_2        |col_3 |
#+-----+-------------+------+
#|a    |1572393600000| 1.000|
#|a    |1572480000000| 1.007|
#|b    |1572825600000| 1.002|
#|b    |1572912000000| 1.000|
#+-----+-------------+------+

说明:

trim(both '][' from json) :删除尾随和前导字符 [ 和 ]，得到类似: 1572393600000, 1.000],[1572480000000, 1.007

trim(both '][' from json) : removes trailing and leading caracters [ and ], get someting like: 1572393600000, 1.000],[1572480000000, 1.007

现在你可以通过 ],[ 分割(\\\ 用于转义括号)

Now you can split by ],[ (\\\ is for escaping the brackets)

transform 从拆分中获取数组，对于每个元素，它以逗号拆分并创建结构 col_2 和 col_3

transform takes the array from the split and for each element, it splits by comma and creates struct col_2 and col_3

展开你从变换中得到的结构数组并星形展开结构列

explode the array of structs you get from the transform and star expand the struct column

这篇关于Spark:如何解析嵌套列表的 JSON 字符串以触发数据框?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark:如何解析嵌套列表的 JSON 字符串以触发数据框? [英] Spark: How to parse JSON string of nested lists to spark data frame?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Spark:如何解析嵌套列表的 JSON 字符串以触发数据框? [英] Spark: How to parse JSON string of nested lists to spark data frame?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭