Spark数据框:基于列的数据透视和分组 [英] Spark dataframe: Pivot and Group based on columns

查看:86
本文介绍了Spark数据框:基于列的数据透视和分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有如下输入数据框,其中包含ID,应用和客户

I have input dataframe as below with id, app, and customer

输入数据框

+--------------------+-----+---------+
|                  id|app  |customer |
+--------------------+-----+---------+
|id1                 |   fw|     WM  |
|id1                 |   fw|     CS  |
|id2                 |   fw|     CS  |
|id1                 |   fe|     WM  |
|id3                 |   bc|     TR  |
|id3                 |   bc|     WM  |
+--------------------+-----+---------+

预期输出

使用数据透视和汇总-将应用程序值作为列名称,并将汇总的客户名称作为列表放置在数据框中

Using pivot and aggregate - make app values as column name and put aggregated customer names as list in the dataframe

预期数据框

+--------------------+----------+-------+----------+
|                  id|   bc     |     fe|    fw    |
+--------------------+----------+-------+----------+
|id1                 |  0       |     WM|   [WM,CS]|
|id2                 |  0       |      0|   [CS]   |
|id3                 | [TR,WM]  |      0|      0   |
+--------------------+----------+-------+----------+

我尝试了什么?

val newDF = df.groupBy("id").pivot("app").agg(expr("coalesce(first(customer),0)")).drop("app").show()

val newDF = df.groupBy("id").pivot("app").agg(expr("coalesce(first(customer),0)")).drop("app").show()

+--------------------+-----+-------+------+
|                  id|bc   |     fe|    fw|
+--------------------+-----+-------+------+
|id1                 |  0  |     WM|    WM|
|id2                 |  0  |      0|    CS|
|id3                 | TR  |      0|     0|
+--------------------+-----+-------+------+

问题:在我的查询中,我无法在"fw"下获得"id1"之类的[WM,CS]客户列表(如预期输出所示),仅"WM"即将到来.同样,对于"id3",仅会出现"TR",而应该在"id3"的"bc"下出现一个值为[TR,WM]的列表

Issue : In my query , i am not able to get the list of customer like [WM,CS] for "id1" under "fw" (as shown in expected output) , only "WM" is coming. Similarly, for "id3" only "TR" is appearing - instead a list should appear with value [TR,WM] under "bc" for "id3"

需要您的建议以分别获取每个应用程序下的客户列表.

Need your suggestion to get the list of customer under each app respectively.

推荐答案

如果可以在要为零的单元格中包含空列表,则可以使用collect_list:

You can use collect_list if you can bear with an empty List at cells where it should be zero:

df.groupBy("id").pivot("app").agg(collect_list("customer")).show
+---+--------+----+--------+
| id|      bc|  fe|      fw|
+---+--------+----+--------+
|id3|[TR, WM]|  []|      []|
|id1|      []|[WM]|[CS, WM]|
|id2|      []|  []|    [CS]|
+---+--------+----+--------+

这篇关于Spark数据框:基于列的数据透视和分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆