Spark DataFrame列转换为Map类型和Map类型列表 [英] Spark DataFrame columns transform to Map type and List of Map Type
本文介绍了Spark DataFrame列转换为Map类型和Map类型列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我具有以下数据框,并感谢有人可以帮助我以以下不同格式获取输出.
I have dataframe as below and Appreciate if someone can help me to get the output in below different format.
输入:
|customerId|transHeader|transLine|
|1001 |1001aa |1001aa1 |
|1001 |1001aa |1001aa2 |
|1001 |1001aa |1001aa3 |
|1001 |1001aa |1001aa4 |
|1002 |1002bb |1002bb1 |
|1002 |1002bb |1002bb2 |
|1002 |1002bb |1002bb3 |
|1002 |1002bb |1002bb4 |
|1003 |1003cc |1003cc1 |
|1003 |1003cc |1003cc2 |
|1003 |1003cc |1003cc3 |
+----------+-----------+---------+
预期的输出集1:
customerId headerLineMapGroup
1001 Map(1001aa -> (1001aa1, 1001aa2, 1001aa3, 1001aa4))
1002 Map(1002bb -> (1002bb1, 1002bb2, 1002bb3, 1002bb4))
1003 Map(1003cc -> (1003cc1, 1003cc2, 1003cc3))
预期的输出集2:
customerId headerLineListOfMapGroup
1001 List[ Map(1001aa -> 1001aa1), Map(1001aa ->1001aa2), Map(1001aa ->1001aa3), Map(1001aa ->1001aa4) ]
1002 List[ Map(1002bb -> 1002bb1), Map(1002bb -> 1002bb2), Map(1002bb -> 1002bb3), Map(1002bb -> 1002bb4)]
1003 List[ Map(1003cc -> 1003cc1), Map(1003cc ->1003cc2), Map(1003cc ->1003cc3) ]
推荐答案
以下是使用udf的解决方案.
Here is the solution using udf.
val spark = SparkSession
.builder()
.master("local")
.appName("ParquetAppendMode")
.getOrCreate()
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
(1001, "1001aa","1001aa1"),
(1001, "1001aa","1001aa2"),
(1001, "1001aa","1001aa3")
)).toDF("customerId", "transHeader", "transLine")
val toMap = udf((header: String, line: Seq[String]) => {
Map(header -> line)
})
val toMapList = udf((header: String, line: Seq[String]) => {
line.map(l => Map(header -> l)).toList
})
val grouped = data.groupBy("customerId", "transHeader").agg(collect_list("transLine").alias("transLine"))
grouped.withColumn("headerLineMapGroup", toMap($"transHeader", $"transLine"))
.drop("transHeader", "transLine")
.show(false)
grouped.withColumn("headerLineMapGroupList", toMapList($"transHeader", $"transLine"))
.drop("transHeader", "transLine")
.show(false)
希望这会有所帮助!
这篇关于Spark DataFrame列转换为Map类型和Map类型列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文