Spark:如何解析嵌套列表的 JSON 字符串以触发数据框? [英] Spark: How to parse JSON string of nested lists to spark data frame?
问题描述
如何解析嵌套列表的 JSON 字符串以在 pyspark 中触发数据框?
How to parse JSON string of nested lists to spark data frame in pyspark ?
输入数据框:
+-------------+-----------------------------------------------+
|url |json |
+-------------+-----------------------------------------------+
|https://url.a|[[1572393600000, 1.000],[1572480000000, 1.007]]|
|https://url.b|[[1572825600000, 1.002],[1572912000000, 1.000]]|
+-------------+-----------------------------------------------+
root
|-- url: string (nullable = true)
|-- json: string (nullable = true)
预期输出:
+---------------------------------------+
|col_1 | col_2 | col_3 |
+---------------------------------------+
| a | 1572393600000 | 1.000 |
| a | 1572480000000 | 1.007 |
| b | 1572825600000 | 1.002 |
| b | 1572912000000 | 1.000 |
+---------------------------------------+
示例代码:
import pyspark
import pyspark.sql.functions as F
spark = (pyspark.sql.SparkSession.builder.appName("Downloader_standalone")
.master('local[*]')
.getOrCreate())
sc = spark.sparkContext
from pyspark.sql import Row
rdd_list = [('https://url.a','[[1572393600000, 1.000],[1572480000000, 1.007]]'),
('https://url.b','[[1572825600000, 1.002],[1572912000000, 1.000]]')]
jsons = sc.parallelize(rdd_list)
df = spark.createDataFrame(jsons, "url string, json string")
df.show(truncate=False)
df.printSchema()
(df.withColumn('json', F.from_json(F.col('json'),"array<string,string>"))
.select(F.explode('json').alias('col_1', 'col_2', 'col_3')).show())
有几个例子,但我不知道怎么做:
There are few examples, but I can not figure out how to do it:
如何从 pyspark 中的火花数据帧行转换具有多个键的 JSON 字符串?
推荐答案
在字符串中进行一些替换并通过拆分可以获得所需的结果:
With some replacements in the strings and by splitting you can get the desired result:
from pyspark.sql import functions as F
df1 = df.withColumn(
"col_1",
F.regexp_replace("url", "https://url.", "")
).withColumn(
"col_2_3",
F.explode(
F.expr("""transform(
split(trim(both '][' from json), '\\\],\\\['),
x -> struct(split(x, ',')[0] as col_2, split(x, ',')[1] as col_3)
)""")
)
).selectExpr("col_1", "col_2_3.*")
df1.show(truncate=False)
#+-----+-------------+------+
#|col_1|col_2 |col_3 |
#+-----+-------------+------+
#|a |1572393600000| 1.000|
#|a |1572480000000| 1.007|
#|b |1572825600000| 1.002|
#|b |1572912000000| 1.000|
#+-----+-------------+------+
说明:
trim(both '][' from json)
:删除尾随和前导字符[
和]
,得到类似:1572393600000, 1.000],[1572480000000, 1.007
trim(both '][' from json)
: removes trailing and leading caracters[
and]
, get someting like:1572393600000, 1.000],[1572480000000, 1.007
现在你可以通过 ],[
分割(\\\
用于转义括号)
Now you can split by ],[
(\\\
is for escaping the brackets)
transform
从拆分中获取数组,对于每个元素,它以逗号拆分并创建结构 col_2
和 col_3
transform
takes the array from the split and for each element, it splits by comma and creates struct col_2
and col_3
展开你从变换中得到的结构数组并星形展开结构列
explode the array of structs you get from the transform and star expand the struct column
这篇关于Spark:如何解析嵌套列表的 JSON 字符串以触发数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!