火花:如何使用Spark解析JSON对象数组 [英] SPARK: How to parse a Array of JSON object using Spark
问题描述
我有一个包含普通列的文件,该列包含一个Json字符串,如下所示.还附有图片.实际上,每一行都属于一个名为Demo的列(在图片中不可见).其他列已被删除并且在图片中不可见,因为它们现在不再需要关注.
I have a file with normal columns and a column that contains a Json string which is as below. Also picture attached. Each row actually belongs to a column named Demo(not Visible in pic).The other columns are removed and not visible in pic because they are not of concern for now.
[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]
请不要更改JSON的格式,因为它与数据文件中的JSON相同,除非所有内容都在一行中.
Please do not change the format of the JSON since it is as above in the data file except everything is in one line.
每行在JSON列下都有一个这样的对象.这些对象全部在一行中但在一个数组中.我想使用spark解析此列并访问其中的每个对象的值.请帮忙.
Each row has one such object under column say JSON. The objects are all in one line but in a array.I would like to Parse this column using spark and access he value of each object inside. Please help.
我想要获得键值"的值.我的目标是从每个JSON对象中提取值"键的值到单独的列中.
What I want is to get value of key "value". My objective is to extract value of "value" key from each JSON object into separate columns.
我尝试使用get_json_object.它适用于以下1)Json字符串,但对于JSON返回null 2)
I tried using get_json_object. It works for the following 1) Json string but returns null for the JSON 2)
- {键":"device_kind",值":桌面"}
- [{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform",值":"windows"}]
我尝试过的代码如下
val jsonDF1 = spark.range(1).selectExpr(""" '{"key":"device_kind","value":"desktop"}' as jsonString""")
jsonDF1.select(get_json_object(col("jsonString"), "$.value") as "device_kind").show(2)// prints desktop under column named device_kind
val jsonDF2 = spark.range(1).selectExpr(""" '[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]' as jsonString""")
jsonDF2.select(get_json_object(col("jsonString"), "$.[0].value") as "device_kind").show(2)// print null but expected is desktop under column named device_kind
接下来,我想使用from_Json,但是我无法弄清楚如何为JSON对象数组构建架构.我发现的所有示例都是嵌套JSON对象的示例,但与上述JSON字符串没有任何相似之处.
Next I wanted to use from_Json but I am unable to figure out how to build schema for Array of JSON objects. All examples I find are that of nested JSON objects but nothing similar to the above JSON string.
我确实发现在sparkR 2.2中,from_Json如果设置为true则具有布尔参数,它将处理上述类型的JSON字符串,即JSON对象数组,但该选项在Spark-Scala 2.3.3中不可用
I did find that in sparkR 2.2 from_Json has a boolean parameter if set true it will handle the above type of JSON string i.e Array of JSON objects but that option is not available in Spark-Scala 2.3.3
要明确输入和预期输出,应如下所示.
To be clear on input and expected output it should be as below.
i/p以下
+------------------------------------------------------------------------+
|Demographics |
+------------------------------------------------------------------------+
|[[device_kind, desktop], [country_code, ID], [device_platform, windows]]|
|[[device_kind, mobile], [country_code, BE], [device_platform, android]] |
|[[device_kind, mobile], [country_code, QA], [device_platform, android]] |
+------------------------------------------------------------------------+
预期下面的o/p
+------------------------------------------------------------------------+-----------+------------+---------------+
|Demographics |device_kind|country_code|device_platform|
+------------------------------------------------------------------------+-----------+------------+---------------+
|[[device_kind, desktop], [country_code, ID], [device_platform, windows]]|desktop |ID |windows |
|[[device_kind, mobile], [country_code, BE], [device_platform, android]] |mobile |BE |android |
|[[device_kind, mobile], [country_code, QA], [device_platform, android]] |mobile |QA |android |
+------------------------------------------------------------------------+-----------+------------+---------------+
推荐答案
Aleh谢谢您的回答.它工作正常. 我以略有不同的方式完成了该解决方案,因为我使用的是2.3.3 spark.
Aleh thank you for answer.It works fine. I did the solution in slightly different way because I am using 2.3.3 spark.
val sch = ArrayType(StructType(Array(
StructField("key", StringType, true),
StructField("value", StringType, true)
)))
val jsonDF3 = mdf.select(from_json(col("jsonString"), sch).alias("Demographics"))
val jsonDF4 = jsonDF3.withColumn("device_kind", expr("Demographics[0].value"))
.withColumn("country_code", expr("Demographics[1].value"))
.withColumn("device_platform", expr("Demographics[2].value"))
这篇关于火花:如何使用Spark解析JSON对象数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!