通过组合类型和子类型的 Apache Spark 组 [英] Apache Spark group by combining types and sub types

查看:22
本文介绍了通过组合类型和子类型的 Apache Spark 组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 spark 中有这个数据集,

I have this dataset in spark,

val sales = Seq(
  ("Warsaw", 2016, "facebook","share",100),
  ("Warsaw", 2017, "facebook","like",200),
  ("Boston", 2015,"twitter","share",50),
  ("Boston", 2016,"facebook","share",150),
  ("Toronto", 2017,"twitter","like",50)
).toDF("city", "year","media","action","amount")

我现在可以像这样按城市和媒体对其进行分组,

I can now group this by city and media like this,

val groupByCityAndYear = sales
  .groupBy("city", "media") 
  .count()
groupByCityAndYear.show()

+-------+--------+-----+
|   city|   media|count|
+-------+--------+-----+
| Boston|facebook|    1|
| Boston| twitter|    1|
|Toronto| twitter|    1|
| Warsaw|facebook|    2|
+-------+--------+-----+

但是,我怎样才能将媒体和动作结合在一列中,所以预期的输出应该是,

But, how can I do combine media and action together in one column, so the expected output should be,

+-------+--------+-----+
| Boston|facebook|    1|
| Boston| share  |    2|
| Boston| twitter|    1|
|Toronto| twitter|    1|
|Toronto| like   |    1|
| Warsaw|facebook|    2|
| Warsaw|share   |    1|
| Warsaw|like    |    1|
+-------+--------+-----+

推荐答案

Combine mediaaction 列作为 array 列,爆炸它,然后做groupBy count:

Combine media and action columns as array column, explode it, then do groupBy count:

sales.select(
    $"city", explode(array($"media", $"action")).as("mediaAction")
).groupBy("city", "mediaAction").count().show()

+-------+-----------+-----+
|   city|mediaAction|count|
+-------+-----------+-----+
| Boston|      share|    2|
| Boston|   facebook|    1|
| Warsaw|      share|    1|
| Boston|    twitter|    1|
| Warsaw|       like|    1|
|Toronto|    twitter|    1|
|Toronto|       like|    1|
| Warsaw|   facebook|    2|
+-------+-----------+-----+

或者假设 mediaaction 不相交(两列没有公共元素):

Or assuming media and action doesn't intersect (the two columns don't have common elements):

sales.groupBy("city", "media").count().union(
    sales.groupBy("city", "action").count()
).show
+-------+--------+-----+
|   city|   media|count|
+-------+--------+-----+
| Boston|facebook|    1|
| Boston| twitter|    1|
|Toronto| twitter|    1|
| Warsaw|facebook|    2|
| Boston|   share|    2|
| Warsaw|   share|    1|
| Warsaw|    like|    1|
|Toronto|    like|    1|
+-------+--------+-----+

这篇关于通过组合类型和子类型的 Apache Spark 组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆