通过 apache spark 将行作为列表与组一起收集 [英] Collect rows as list with group by apache spark
问题描述
我有一个特定的用例,我为同一客户有多个行,其中每个行对象看起来像:
I have a particular use case where I have multiple rows for same customer where each row object looks like:
root
-c1: BigInt
-c2: String
-c3: Double
-c4: Double
-c5: Map[String, Int]
现在我已经按列 c1 进行分组并将所有行收集为同一客户的列表,例如:
Now I have do group by column c1 and collect all the rows as list for same customer like:
c1, [Row1, Row3, Row4]
c2, [Row2, Row5]
我试过这样做dataset.withColumn("combined", array("c1","c2","c3","c4","c5")).groupBy("c1").agg(collect_list("combined"))
但我得到一个例外:
I tried doing this ways
dataset.withColumn("combined", array("c1","c2","c3","c4","c5")).groupBy("c1").agg(collect_list("combined"))
but I get an exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'array(`c1`, `c2`, `c3`, `c4`, `c5`)' due to data type mismatch: input to function array should all be the same type, but it's [bigint, string, double, double, map<string,map<string,double>>];;
推荐答案
您可以使用 struct
函数来组合列并使用 groupBy<而不是
array
/code> 和 collect_list
聚合函数为
Instead of array
you can use struct
function to combine the columns and use groupBy
and collect_list
aggregation function as
import org.apache.spark.sql.functions._
df.withColumn("combined", struct("c1","c2","c3","c4","c5"))
.groupBy("c1").agg(collect_list("combined").as("combined_list"))
.show(false)
以便您将分组数据集与 schema
作为
root
|-- c1: integer (nullable = false)
|-- combined_list: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- c1: integer (nullable = false)
| | |-- c2: string (nullable = true)
| | |-- c3: string (nullable = true)
| | |-- c4: string (nullable = true)
| | |-- c5: map (nullable = true)
| | | |-- key: string
| | | |-- value: integer (valueContainsNull = false)
希望回答对你有帮助
这篇关于通过 apache spark 将行作为列表与组一起收集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!