如何将集合分组为数据集上的运算符/方法? [英] How to GROUPING SETS as operator/method on Dataset?

查看:16
本文介绍了如何将集合分组为数据集上的运算符/方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

spark scala 中没有函数级 grouping_sets 支持吗?

Is there no function level grouping_sets support in spark scala?

我不知道这个补丁适用于masterhttps://github.com/apache/spark/pull/5080

I have no idea this patch applied to master https://github.com/apache/spark/pull/5080

我想通过scala dataframe api做这种查询.

I want to do this kind of query by scala dataframe api.

GROUP BY expression list GROUPING SETS(expression list2)

cuberollup functions 在 Dataset API 中可用,但找不到分组集.为什么?

cube and rollup functions are available in Dataset API, but can't find grouping sets. Why?

推荐答案

我想通过scala dataframe api做这种查询.

I want to do this kind of query by scala dataframe api.

tl;dr 到 Spark 2.1.0,这是不可能的.目前没有计划将此类运算符添加到 Dataset API.

tl;dr Up to Spark 2.1.0 it is not possible. There are currently no plans to add such an operator to Dataset API.

Spark SQL 支持以下所谓的多维聚合运算符:

Spark SQL supports the following so-called multi-dimensional aggregate operators:

  • rollup 运算符
  • cube 运算符
  • GROUPING SETS 子句(仅在 SQL 模式下)
  • grouping()grouping_id() 函数
  • rollup operator
  • cube operator
  • GROUPING SETS clause (only in SQL mode)
  • grouping() and grouping_id() functions

注意:GROUPING SETS 仅在 SQL 模式下可用.数据集 API 中不支持.

NOTE: GROUPING SETS is only available in SQL mode. There is no support in Dataset API.

val sales = Seq(
  ("Warsaw", 2016, 100),
  ("Warsaw", 2017, 200),
  ("Boston", 2015, 50),
  ("Boston", 2016, 150),
  ("Toronto", 2017, 50)
).toDF("city", "year", "amount")
sales.createOrReplaceTempView("sales")

// equivalent to rollup("city", "year")
val q = sql("""
  SELECT city, year, sum(amount) as amount
  FROM sales
  GROUP BY city, year
  GROUPING SETS ((city, year), (city), ())
  ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
  """)
scala> q.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100|
| Warsaw|2017|   200|
| Warsaw|null|   300|
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
|   null|null|   550|  <-- grand total across all cities and years
+-------+----+------+

// equivalent to cube("city", "year")
// note the additional (year) grouping set
val q = sql("""
  SELECT city, year, sum(amount) as amount
  FROM sales
  GROUP BY city, year
  GROUPING SETS ((city, year), (city), (year), ())
  ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
  """)
scala> q.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100|
| Warsaw|2017|   200|
| Warsaw|null|   300|
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
|   null|2015|    50|  <-- total across all cities in 2015
|   null|2016|   250|  <-- total across all cities in 2016
|   null|2017|   250|  <-- total across all cities in 2017
|   null|null|   550|
+-------+----+------+

这篇关于如何将集合分组为数据集上的运算符/方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆