Spark数据框:基于列的数据透视和分组 [英] Spark dataframe: Pivot and Group based on columns
问题描述
我有如下输入数据框,其中包含ID,应用和客户
I have input dataframe as below with id, app, and customer
输入数据框
+--------------------+-----+---------+
| id|app |customer |
+--------------------+-----+---------+
|id1 | fw| WM |
|id1 | fw| CS |
|id2 | fw| CS |
|id1 | fe| WM |
|id3 | bc| TR |
|id3 | bc| WM |
+--------------------+-----+---------+
预期输出
使用数据透视和汇总-将应用程序值作为列名称,并将汇总的客户名称作为列表放置在数据框中
Using pivot and aggregate - make app values as column name and put aggregated customer names as list in the dataframe
预期数据框
+--------------------+----------+-------+----------+
| id| bc | fe| fw |
+--------------------+----------+-------+----------+
|id1 | 0 | WM| [WM,CS]|
|id2 | 0 | 0| [CS] |
|id3 | [TR,WM] | 0| 0 |
+--------------------+----------+-------+----------+
我尝试了什么?
val newDF = df.groupBy("id").pivot("app").agg(expr("coalesce(first(customer),0)")).drop("app").show()
val newDF = df.groupBy("id").pivot("app").agg(expr("coalesce(first(customer),0)")).drop("app").show()
+--------------------+-----+-------+------+
| id|bc | fe| fw|
+--------------------+-----+-------+------+
|id1 | 0 | WM| WM|
|id2 | 0 | 0| CS|
|id3 | TR | 0| 0|
+--------------------+-----+-------+------+
问题:在我的查询中,我无法在"fw"下获得"id1"之类的[WM,CS]客户列表(如预期输出所示),仅"WM"即将到来.同样,对于"id3",仅会出现"TR",而应该在"id3"的"bc"下出现一个值为[TR,WM]的列表
Issue : In my query , i am not able to get the list of customer like [WM,CS] for "id1" under "fw" (as shown in expected output) , only "WM" is coming. Similarly, for "id3" only "TR" is appearing - instead a list should appear with value [TR,WM] under "bc" for "id3"
需要您的建议以分别获取每个应用程序下的客户列表.
Need your suggestion to get the list of customer under each app respectively.
推荐答案
如果可以在要为零的单元格中包含空列表,则可以使用collect_list
:
You can use collect_list
if you can bear with an empty List at cells where it should be zero:
df.groupBy("id").pivot("app").agg(collect_list("customer")).show
+---+--------+----+--------+
| id| bc| fe| fw|
+---+--------+----+--------+
|id3|[TR, WM]| []| []|
|id1| []|[WM]|[CS, WM]|
|id2| []| []| [CS]|
+---+--------+----+--------+
这篇关于Spark数据框:基于列的数据透视和分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!