将具有JSON对象数组的Spark数据框列转换为多行 [英] Convert an Spark dataframe columns with an array of JSON objects to multiple rows

查看：412 发布时间：2020/9/4 5:08:29 apache-spark apache-spark-sql spark-streaming

本文介绍了将具有JSON对象数组的Spark数据框列转换为多行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个流式JSON数据，其结构可以用下面的case类描述

I have a streaming JSON data, whose structure can be described with the case class below

case class Hello(A: String, B: Array[Map[String, String]])

相同的样本数据如下

|  A    | B                                        |
|-------|------------------------------------------|
|  ABC  |  [{C:1, D:1}, {C:2, D:4}]                | 
|  XYZ  |  [{C:3, D :6}, {C:9, D:11}, {C:5, D:12}] |

我想将其转换为

|   A   |  C  |  D   |
|-------|-----|------|
|  ABC  |  1  |  1   |
|  ABC  |  2  |  4   |
|  XYZ  |  3  |  6   |
|  XYZ  |  9  |  11  |
|  XYZ  |  5  |  12  |

任何帮助将不胜感激.

推荐答案

随着问题的发展，我将原来的答案留在了那里，这解决了最后一个问题.

As the question went through an evolution I leave the original answer there and this addresses the final question.

重要的一点，现在可以满足以下要求的输入内容:

Important point, the input mentioned as follows is now catered for:

val df0 = Seq (
            ("ABC", List(Map("C" -> "1", "D" -> "2"), Map("C" -> "3", "D" -> "4"))),
            ("XYZ", List(Map("C" -> "44", "D" -> "55"), Map("C" -> "188", "D" -> "199"), Map("C" -> "88", "D" -> "99")))
              )
             .toDF("A", "B")

也可以这样做，但是随后需要对此脚本进行修改，尽管这很简单:

Can also be done like this, but then the script needs to be modified for this, although trivial:

val df0 = Seq (
           ("ABC", List(Map("C" -> "1",  "D" -> "2"))), 
           ("ABC", List(Map("C" -> "44", "D" -> "55"))),
           ("XYZ", List(Map("C" -> "11", "D" -> "22")))
              )
            .toDF("A", "B")

然后按照要求的格式进行操作:

Following on from requested format then:

val df1 = df0.select($"A", explode($"B")).toDF("A", "Bn")

val df2 = df1.withColumn("SeqNum", monotonically_increasing_id()).toDF("A", "Bn", "SeqNum") 

val df3 = df2.select($"A", explode($"Bn"), $"SeqNum").toDF("A", "B", "C", "SeqNum")

val df4 = df3.withColumn("dummy", concat( $"SeqNum", lit("||"), $"A"))

val df5 = df4.select($"dummy", $"B", $"C").groupBy("dummy").pivot("B").agg(first($"C")) 

val df6 = df5.withColumn("A", substring_index(col("dummy"), "||", -1)).drop("dummy")

df6.show(false)

+---+---+---+
|C  |D  |A  |
+---+---+---+
|3  |4  |ABC|
|1  |2  |ABC|
|88 |99 |XYZ|
|188|199|XYZ|
|44 |55 |XYZ|
+---+---+---+

您可以重新排序列.

这篇关于将具有JSON对象数组的Spark数据框列转换为多行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将具有JSON对象数组的Spark数据框列转换为多行 [英] Convert an Spark dataframe columns with an array of JSON objects to multiple rows

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

将具有JSON对象数组的Spark数据框列转换为多行 [英] Convert an Spark dataframe columns with an array of JSON objects to multiple rows

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭