数据框 Spark Scala 爆炸 json 数组 [英] dataframe Spark scala explode json array

查看:41
本文介绍了数据框 Spark Scala 爆炸 json 数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个如下所示的数据框:

Let's say I have a dataframe which looks like this:

+--------------------+--------------------+--------------------------------------------------------------+
|                id  |           Name     |                                                       Payment|
+--------------------+--------------------+--------------------------------------------------------------+
|                1   |           James    |[ {"@id": 1, "currency":"GBP"},{"@id": 2, "currency": "USD"} ]|
+--------------------+--------------------+--------------------------------------------------------------+

架构是:

|-- id: integer (nullable = true)
|-- Name: string (nullable = true)   
|-- Payment: string (nullable = true)

如何将上面的 JSON 数组分解为下面的内容:

How can I explode the above JSON array into below:

+--------------------+--------------------+-------------------------------+
|                id  |           Name     |                        Payment|
+--------------------+--------------------+-------------------------------+
|                1   |           James    |   {"@id":1, "currency":"GBP"} |
+--------------------+--------------------+-------------------------------+
|                1   |           James    |   {"@id":2, "currency":"USD"} |
+--------------------+--------------------+-------------------------------+

我一直在尝试使用如下所示的爆炸功能,但它不起作用.它给出了一个关于无法分解字符串类型的错误,并且它需要一个映射或数组.鉴于架构表示它是一个字符串,而不是一个数组/映射,这是有道理的,但我不确定如何将其转换为适当的格式.

I've been trying to use the explode functionality like the below, but it's not working. It's giving an error about not being able to explode string types, and that it expects either a map or array. This makes sense given the schema denotes it's a string, rather than an array/map, but I'm not sure how to convert this into an appropriate format.

val newDF = dataframe.withColumn("nestedPayment", explode(dataframe.col("Payment")))

非常感谢任何帮助!

推荐答案

您必须将 JSON 字符串解析为 array JSON,然后使用 explode在结果上(explode 需要一个数组).

You'll have to parse the JSON string into an array of JSONs, and then use explode on the result (explode expects an array).

要做到这一点(假设是 Spark 2.0.*):

  • 如果您知道所有 Payment 值都包含一个表示具有相同大小的数组的 json(例如在本例中为 2),您可以硬编码提取第一个和第二个元素,将它们包装在一个数组中并爆炸:

  • If you know all Payment values contain a json representing an array with the same size (e.g. 2 in this case), you can hard-code extraction of the first and second elements, wrap them in an array and explode:

val newDF = dataframe.withColumn("Payment", explode(array(
  get_json_object($"Payment", "$[0]"),
  get_json_object($"Payment", "$[1]")
)))

  • 如果你不能保证所有记录都有一个包含 2 个元素的数组的 JSON,但你可以保证这些数组的最大长度,你可以使用这个技巧将元素解析到最大大小,然后过滤掉结果 nulls:

  • If you can't guarantee all records have a JSON with a 2-element array, but you can guarantee a maximum length of these arrays, you can use this trick to parse elements up to the maximum size and then filter out the resulting nulls:

    val maxJsonParts = 3 // whatever that number is...
    val jsonElements = (0 until maxJsonParts)
                         .map(i => get_json_object($"Payment", s"$$[$i]"))
    
    val newDF = dataframe
      .withColumn("Payment", explode(array(jsonElements: _*)))
      .where(!isnull($"Payment")) 
    

  • 这篇关于数据框 Spark Scala 爆炸 json 数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆