将带有 JSON 对象数组的 Spark 数据帧列转换为多行 [英] Convert an Spark dataframe columns with an array of JSON objects to multiple rows
问题描述
我有一个流式 JSON 数据,其结构可以用下面的案例类来描述
case class Hello(A: String, B: Array[Map[String, String]])
相同的样本数据如下
<代码>|一个 |乙 ||-------|------------------------------------------||ABC |[{C:1, D:1}, {C:2, D:4}] ||XYZ |[{C:3, D:6}, {C:9, D:11}, {C:5, D:12}] |
我想把它改成
<代码>|一个 |C |D ||-------|-----|------||ABC |1 |1 ||ABC |2 |4 ||XYZ |3 |6 ||XYZ |9 |11 ||XYZ |5 |12 |
任何帮助将不胜感激.
随着问题的演变,我把原来的答案留在那里,这解决了最后一个问题.
<块引用>重要的一点,现在提到的输入如下:
val df0 = Seq (("ABC", List(Map("C" -> "1", "D" -> "2"), Map("C" -> "3", "D" -> "4"))),("XYZ", List(Map("C" -> "44", "D" -> "55"), Map("C" -> "188", "D" -> "199""), Map("C" -> "88", "D" -> "99")))).toDF("A", "B")
<块引用>
也可以这样,不过接下来需要为此修改脚本,虽然微不足道:
val df0 = Seq (("ABC", List(Map("C" -> "1", "D" -> "2"))),("ABC", List(Map("C" -> "44", "D" -> "55"))),("XYZ", List(Map("C" -> "11", "D" -> "22")))).toDF("A", "B")
<块引用>
从请求的格式开始:
val df1 = df0.select($"A", expand($"B")).toDF("A", "Bn")val df2 = df1.withColumn("SeqNum", monotonically_increasing_id()).toDF("A", "Bn", "SeqNum")val df3 = df2.select($"A", expand($"Bn"), $"SeqNum").toDF("A", "B", "C", "SeqNum")val df4 = df3.withColumn("dummy", concat( $"SeqNum", lit("||"), $"A"))val df5 = df4.select($"dummy", $"B", $"C").groupBy("dummy").pivot("B").agg(first($"C"))val df6 = df5.withColumn("A", substring_index(col("dummy"), "||", -1)).drop("dummy")df6.show(假)
返回:
+---+---+---+|C |D |A |+---+---+---+|3 |4 |ABC||1 |2 |ABC||88 |99 |XYZ||188|199|XYZ||44 |55 |XYZ|+---+---+---+
您可以对列重新排序.
I have a streaming JSON data, whose structure can be described with the case class below
case class Hello(A: String, B: Array[Map[String, String]])
Sample data for the same is as below
| A | B |
|-------|------------------------------------------|
| ABC | [{C:1, D:1}, {C:2, D:4}] |
| XYZ | [{C:3, D :6}, {C:9, D:11}, {C:5, D:12}] |
I want to transform it to
| A | C | D |
|-------|-----|------|
| ABC | 1 | 1 |
| ABC | 2 | 4 |
| XYZ | 3 | 6 |
| XYZ | 9 | 11 |
| XYZ | 5 | 12 |
Any help will be appreciated.
As the question went through an evolution I leave the original answer there and this addresses the final question.
Important point, the input mentioned as follows is now catered for:
val df0 = Seq (
("ABC", List(Map("C" -> "1", "D" -> "2"), Map("C" -> "3", "D" -> "4"))),
("XYZ", List(Map("C" -> "44", "D" -> "55"), Map("C" -> "188", "D" -> "199"), Map("C" -> "88", "D" -> "99")))
)
.toDF("A", "B")
Can also be done like this, but then the script needs to be modified for this, although trivial:
val df0 = Seq (
("ABC", List(Map("C" -> "1", "D" -> "2"))),
("ABC", List(Map("C" -> "44", "D" -> "55"))),
("XYZ", List(Map("C" -> "11", "D" -> "22")))
)
.toDF("A", "B")
Following on from requested format then:
val df1 = df0.select($"A", explode($"B")).toDF("A", "Bn")
val df2 = df1.withColumn("SeqNum", monotonically_increasing_id()).toDF("A", "Bn", "SeqNum")
val df3 = df2.select($"A", explode($"Bn"), $"SeqNum").toDF("A", "B", "C", "SeqNum")
val df4 = df3.withColumn("dummy", concat( $"SeqNum", lit("||"), $"A"))
val df5 = df4.select($"dummy", $"B", $"C").groupBy("dummy").pivot("B").agg(first($"C"))
val df6 = df5.withColumn("A", substring_index(col("dummy"), "||", -1)).drop("dummy")
df6.show(false)
returns:
+---+---+---+
|C |D |A |
+---+---+---+
|3 |4 |ABC|
|1 |2 |ABC|
|88 |99 |XYZ|
|188|199|XYZ|
|44 |55 |XYZ|
+---+---+---+
You may re-sequence columns.
这篇关于将带有 JSON 对象数组的 Spark 数据帧列转换为多行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!