火花:如何使用Spark解析JSON对象数组 [英] SPARK: How to parse a Array of JSON object using Spark

查看:184
本文介绍了火花:如何使用Spark解析JSON对象数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含普通列的文件,该列包含一个Json字符串,如下所示.还附有图片.实际上,每一行都属于一个名为Demo的列(在图片中不可见).其他列已被删除并且在图片中不可见,因为它们现在不再需要关注.

I have a file with normal columns and a column that contains a Json string which is as below. Also picture attached. Each row actually belongs to a column named Demo(not Visible in pic).The other columns are removed and not visible in pic because they are not of concern for now.

[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]

请不要更改JSON的格式,因为它与数据文件中的JSON相同,除非所有内容都在一行中.

Please do not change the format of the JSON since it is as above in the data file except everything is in one line.

每行在JSON列下都有一个这样的对象.这些对象全部在一行中但在一个数组中.我想使用spark解析此列并访问其中的每个对象的值.请帮忙.

Each row has one such object under column say JSON. The objects are all in one line but in a array.I would like to Parse this column using spark and access he value of each object inside. Please help.

我想要获得键值"的值.我的目标是从每个JSON对象中提取值"键的值到单独的列中.

What I want is to get value of key "value". My objective is to extract value of "value" key from each JSON object into separate columns.

我尝试使用get_json_object.它适用于以下1)Json字符串,但对于JSON返回null 2)

I tried using get_json_object. It works for the following 1) Json string but returns null for the JSON 2)

  1. {键":"device_kind",值":桌面"}
  2. [{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform",值":"windows"}]

我尝试过的代码如下

val jsonDF1 = spark.range(1).selectExpr(""" '{"key":"device_kind","value":"desktop"}' as jsonString""")

jsonDF1.select(get_json_object(col("jsonString"), "$.value") as "device_kind").show(2)// prints desktop under column named device_kind

val jsonDF2 = spark.range(1).selectExpr(""" '[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]' as jsonString""")

jsonDF2.select(get_json_object(col("jsonString"), "$.[0].value") as "device_kind").show(2)// print null but expected is desktop under column named device_kind

接下来,我想使用from_Json,但是我无法弄清楚如何为JSON对象数组构建架构.我发现的所有示例都是嵌套JSON对象的示例,但与上述JSON字符串没有任何相似之处.

Next I wanted to use from_Json but I am unable to figure out how to build schema for Array of JSON objects. All examples I find are that of nested JSON objects but nothing similar to the above JSON string.

我确实发现在sparkR 2.2中,from_Json如果设置为true则具有布尔参数,它将处理上述类型的JSON字符串,即JSON对象数组,但该选项在Spark-Scala 2.3.3中不可用

I did find that in sparkR 2.2 from_Json has a boolean parameter if set true it will handle the above type of JSON string i.e Array of JSON objects but that option is not available in Spark-Scala 2.3.3

要明确输入和预期输出,应如下所示.

To be clear on input and expected output it should be as below.

i/p以下

+------------------------------------------------------------------------+
|Demographics                                                            |
+------------------------------------------------------------------------+
|[[device_kind, desktop], [country_code, ID], [device_platform, windows]]|
|[[device_kind, mobile], [country_code, BE], [device_platform, android]] |
|[[device_kind, mobile], [country_code, QA], [device_platform, android]] |
+------------------------------------------------------------------------+

预期下面的o/p

+------------------------------------------------------------------------+-----------+------------+---------------+
|Demographics                                                            |device_kind|country_code|device_platform|
+------------------------------------------------------------------------+-----------+------------+---------------+
|[[device_kind, desktop], [country_code, ID], [device_platform, windows]]|desktop    |ID          |windows        |
|[[device_kind, mobile], [country_code, BE], [device_platform, android]] |mobile     |BE          |android        |
|[[device_kind, mobile], [country_code, QA], [device_platform, android]] |mobile     |QA          |android        |
+------------------------------------------------------------------------+-----------+------------+---------------+

推荐答案

Aleh谢谢您的回答.它工作正常. 我以略有不同的方式完成了该解决方案,因为我使用的是2.3.3 spark.

Aleh thank you for answer.It works fine. I did the solution in slightly different way because I am using 2.3.3 spark.

val sch = ArrayType(StructType(Array(
  StructField("key", StringType, true),
  StructField("value", StringType, true)
)))

val jsonDF3 = mdf.select(from_json(col("jsonString"), sch).alias("Demographics"))

val jsonDF4 = jsonDF3.withColumn("device_kind", expr("Demographics[0].value"))
  .withColumn("country_code", expr("Demographics[1].value"))
  .withColumn("device_platform", expr("Demographics[2].value"))

这篇关于火花:如何使用Spark解析JSON对象数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆