SPARK:如何使用 Spark 解析 JSON 对象数组 [英] SPARK: How to parse a Array of JSON object using Spark

查看:27
本文介绍了SPARK:如何使用 Spark 解析 JSON 对象数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含普通列的文件和一个包含 Json 字符串的列,如下所示.还附上图.每一行实际上属于一个名为 Demo 的列(在 pic 中不可见).其他列被删除并且在 pic 中不可见,因为它们现在不关心.

I have a file with normal columns and a column that contains a Json string which is as below. Also picture attached. Each row actually belongs to a column named Demo(not Visible in pic).The other columns are removed and not visible in pic because they are not of concern for now.

[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]

请不要更改 JSON 的格式,因为它与数据文件中的格式相同,只是所有内容都在一行中.

Please do not change the format of the JSON since it is as above in the data file except everything is in one line.

每一行在列下都有一个这样的对象,比如 JSON.对象都在一行中,但在一个数组中.我想使用 spark 解析此列并访问其中每个对象的值.请帮忙.

Each row has one such object under column say JSON. The objects are all in one line but in a array.I would like to Parse this column using spark and access he value of each object inside. Please help.

我想要的是获得键值"的值.我的目标是从每个 JSON 对象中提取值"键的值到单独的列中.

What I want is to get value of key "value". My objective is to extract value of "value" key from each JSON object into separate columns.

我尝试使用 get_json_object.它适用于以下 1) Json 字符串,但为 JSON 2) 返回 null

I tried using get_json_object. It works for the following 1) Json string but returns null for the JSON 2)

  1. {"key":"device_kind","value":"desktop"}
  2. [{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","值":"windows"}]

我试过的代码如下

val jsonDF1 = spark.range(1).selectExpr(""" '{"key":"device_kind","value":"desktop"}' as jsonString""")

jsonDF1.select(get_json_object(col("jsonString"), "$.value") as "device_kind").show(2)// prints desktop under column named device_kind

val jsonDF2 = spark.range(1).selectExpr(""" '[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]' as jsonString""")

jsonDF2.select(get_json_object(col("jsonString"), "$.[0].value") as "device_kind").show(2)// print null but expected is desktop under column named device_kind

接下来我想使用 from_Json 但我无法弄清楚如何为 JSON 对象数组构建架构.我发现的所有示例都是嵌套 JSON 对象的示例,但与上述 JSON 字符串没有任何相似之处.

Next I wanted to use from_Json but I am unable to figure out how to build schema for Array of JSON objects. All examples I find are that of nested JSON objects but nothing similar to the above JSON string.

我确实发现在 sparkR 2.2 from_Json 中有一个布尔参数,如果设置为 true,它将处理上述类型的 JSON 字符串,即 JSON 对象数组,但该选项在 Spark-Scala 2.3.3 中不可用

I did find that in sparkR 2.2 from_Json has a boolean parameter if set true it will handle the above type of JSON string i.e Array of JSON objects but that option is not available in Spark-Scala 2.3.3

为了明确输入和预期输出,它应该如下所示.

To be clear on input and expected output it should be as below.

i/p 下面

+------------------------------------------------------------------------+
|Demographics                                                            |
+------------------------------------------------------------------------+
|[[device_kind, desktop], [country_code, ID], [device_platform, windows]]|
|[[device_kind, mobile], [country_code, BE], [device_platform, android]] |
|[[device_kind, mobile], [country_code, QA], [device_platform, android]] |
+------------------------------------------------------------------------+

低于预期的 o/p

+------------------------------------------------------------------------+-----------+------------+---------------+
|Demographics                                                            |device_kind|country_code|device_platform|
+------------------------------------------------------------------------+-----------+------------+---------------+
|[[device_kind, desktop], [country_code, ID], [device_platform, windows]]|desktop    |ID          |windows        |
|[[device_kind, mobile], [country_code, BE], [device_platform, android]] |mobile     |BE          |android        |
|[[device_kind, mobile], [country_code, QA], [device_platform, android]] |mobile     |QA          |android        |
+------------------------------------------------------------------------+-----------+------------+---------------+

推荐答案

Aleh 谢谢你的回答.它工作正常.我以稍微不同的方式解决了这个问题,因为我使用的是 2.3.3 spark.

Aleh thank you for answer.It works fine. I did the solution in slightly different way because I am using 2.3.3 spark.

val sch = ArrayType(StructType(Array(
  StructField("key", StringType, true),
  StructField("value", StringType, true)
)))

val jsonDF3 = mdf.select(from_json(col("jsonString"), sch).alias("Demographics"))

val jsonDF4 = jsonDF3.withColumn("device_kind", expr("Demographics[0].value"))
  .withColumn("country_code", expr("Demographics[1].value"))
  .withColumn("device_platform", expr("Demographics[2].value"))

这篇关于SPARK:如何使用 Spark 解析 JSON 对象数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆