数据帧Spark Scala爆炸了JSON数组 [英] dataframe Spark scala explode json array

查看:136
本文介绍了数据帧Spark Scala爆炸了JSON数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个看起来像这样的数据框:

Let's say I have a dataframe which looks like this:

+--------------------+--------------------+--------------------------------------------------------------+
|                id  |           Name     |                                                       Payment|
+--------------------+--------------------+--------------------------------------------------------------+
|                1   |           James    |[ {"@id": 1, "currency":"GBP"},{"@id": 2, "currency": "USD"} ]|
+--------------------+--------------------+--------------------------------------------------------------+

模式是:

|-- id: integer (nullable = true)
|-- Name: string (nullable = true)   
|-- Payment: string (nullable = true)

如何将上述JSON数组分解为以下内容:

How can I explode the above JSON array into below:

+--------------------+--------------------+-------------------------------+
|                id  |           Name     |                        Payment|
+--------------------+--------------------+-------------------------------+
|                1   |           James    |   {"@id":1, "currency":"GBP"} |
+--------------------+--------------------+-------------------------------+
|                1   |           James    |   {"@id":2, "currency":"USD"} |
+--------------------+--------------------+-------------------------------+

我一直在尝试使用如下所示的爆炸功能,但无法正常工作.它给出了一个关于无法爆炸字符串类型的错误,并且期望使用映射或数组.考虑到架构表示它是一个字符串,而不是数组/映射,这是有道理的,但是我不确定如何将其转换为适当的格式.

I've been trying to use the explode functionality like the below, but it's not working. It's giving an error about not being able to explode string types, and that it expects either a map or array. This makes sense given the schema denotes it's a string, rather than an array/map, but I'm not sure how to convert this into an appropriate format.

val newDF = dataframe.withColumn("nestedPayment", explode(dataframe.col("Payment")))

任何帮助将不胜感激!

推荐答案

您必须将JSON字符串解析为JSON的 array ,然后在结果上使用explode(爆炸需要一个数组).

You'll have to parse the JSON string into an array of JSONs, and then use explode on the result (explode expects an array).

要这样做(假设使用Spark 2.0.* ):

  • 如果您知道所有Payment值均包含表示相同大小数组的json(例如,本例中为2),则可以对第一个和第二个元素进行硬编码提取,并将它们包装在数组中并爆炸:

  • If you know all Payment values contain a json representing an array with the same size (e.g. 2 in this case), you can hard-code extraction of the first and second elements, wrap them in an array and explode:

val newDF = dataframe.withColumn("Payment", explode(array(
  get_json_object($"Payment", "$[0]"),
  get_json_object($"Payment", "$[1]")
)))

  • 如果您不能保证所有记录都具有包含2个元素的数组的JSON,但是您可以可以保证这些数组的最大长度为 可以使用此技巧来解析最大大小的元素,然后过滤出生成的null s:

  • If you can't guarantee all records have a JSON with a 2-element array, but you can guarantee a maximum length of these arrays, you can use this trick to parse elements up to the maximum size and then filter out the resulting nulls:

    val maxJsonParts = 3 // whatever that number is...
    val jsonElements = (0 until maxJsonParts)
                         .map(i => get_json_object($"Payment", s"$$[$i]"))
    
    val newDF = dataframe
      .withColumn("Payment", explode(array(jsonElements: _*)))
      .where(!isnull($"Payment")) 
    

  • 这篇关于数据帧Spark Scala爆炸了JSON数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆