多维数据集、汇总和 groupBy 运算符之间有什么区别? [英] What is the difference between cube, rollup and groupBy operators?

查看：24 发布时间：2021/11/14 21:28:24 sql apache-spark apache-spark-sql cube rollup

本文介绍了多维数据集、汇总和 groupBy 运算符之间有什么区别?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

问题几乎在标题中.我找不到有关这些差异的任何详细文档.

Question is pretty much in the title. I can't find any detailed documentation regarding the differences.

我确实注意到了一个不同之处，因为在交换多维数据集和 groupBy 函数调用时，我得到了不同的结果.我注意到对于使用 'cube' 的结果，我在经常分组的表达式上得到了很多空值.

I do notice a difference because when interchanging cube and groupBy function calls, I get different results. I noticed that for the result using 'cube', I got a lot of null values on the expressions I often grouped by.

推荐答案

这些并非旨在以相同的方式工作.groupBy 相当于标准 SQL 中的 GROUP BY 子句.换句话说

These are not intended to work in the same way. groupBy is simply an equivalent of the GROUP BY clause in standard SQL. In other words

table.groupBy($"foo", $"bar")

相当于:

SELECT foo, bar, [agg-expressions] FROM table GROUP BY foo, bar

cube 相当于 CUBE 对 GROUP BY 的扩展.它需要一个列列表并将聚合表达式应用于分组列的所有可能组合.假设您有这样的数据:

cube is equivalent to CUBE extension to GROUP BY. It takes a list of columns and applies aggregate expressions to all possible combinations of the grouping columns. Lets say you have data like this:

val df = Seq(("foo", 1L), ("foo", 2L), ("bar", 2L), ("bar", 2L)).toDF("x", "y")

df.show

// +---+---+
// |  x|  y|
// +---+---+
// |foo|  1|
// |foo|  2|
// |bar|  2|
// |bar|  2|
// +---+---+

然后你计算 cube(x, y) 并将 count 作为聚合:

and you compute cube(x, y) with count as an aggregation:

df.cube($"x", $"y").count.show

// +----+----+-----+     
// |   x|   y|count|
// +----+----+-----+
// |null|   1|    1|   <- count of records where y = 1
// |null|   2|    3|   <- count of records where y = 2
// | foo|null|    2|   <- count of records where x = foo
// | bar|   2|    2|   <- count of records where x = bar AND y = 2
// | foo|   1|    1|   <- count of records where x = foo AND y = 1
// | foo|   2|    1|   <- count of records where x = foo AND y = 2
// |null|null|    4|   <- total count of records
// | bar|null|    2|   <- count of records where x = bar
// +----+----+-----+

与 cube 类似的函数是 rollup，它从左到右计算分层小计:

A similar function to cube is rollup which computes hierarchical subtotals from left to right:

df.rollup($"x", $"y").count.show
// +----+----+-----+
// |   x|   y|count|
// +----+----+-----+
// | foo|null|    2|   <- count where x is fixed to foo
// | bar|   2|    2|   <- count where x is fixed to bar and y is fixed to  2
// | foo|   1|    1|   ...
// | foo|   2|    1|   ...
// |null|null|    4|   <- count where no column is fixed
// | bar|null|    2|   <- count where x is fixed to bar
// +----+----+-----+

为了比较，让我们看看普通groupBy的结果:

Just for comparison lets see the result of plain groupBy:

df.groupBy($"x", $"y").count.show

// +---+---+-----+
// |  x|  y|count|
// +---+---+-----+
// |foo|  1|    1|   <- this is identical to x = foo AND y = 1 in CUBE or ROLLUP
// |foo|  2|    1|   <- this is identical to x = foo AND y = 2 in CUBE or ROLLUP
// |bar|  2|    2|   <- this is identical to x = bar AND y = 2 in CUBE or ROLLUP
// +---+---+-----+

总结:

当使用普通的 GROUP BY 时，每一行在其相应的摘要中只包含一次.
使用 GROUP BY CUBE(..) 时，每一行都包含在它所代表的每个级别组合的摘要中，包括通配符.从逻辑上讲，上面显示的内容等效于这样的内容(假设我们可以使用 NULL 占位符):

When using plain GROUP BY every row is included only once in its corresponding summary.
With GROUP BY CUBE(..) every row is included in summary of each combination of levels it represents, wildcards included. Logically, the shown above is equivalent to something like this (assuming we could use NULL placeholders):

SELECT NULL, NULL, COUNT(*) FROM table
UNION ALL
SELECT x,    NULL, COUNT(*) FROM table GROUP BY x
UNION ALL
SELECT NULL, y,    COUNT(*) FROM table GROUP BY y
UNION ALL
SELECT x,    y,    COUNT(*) FROM table GROUP BY x, y

With GROUP BY ROLLUP(...) 与 CUBE 类似，但通过从左到右填充列来分层工作.

With GROUP BY ROLLUP(...) is similar to CUBE but works hierarchically by filling colums from left to right.

SELECT NULL, NULL, COUNT(*) FROM table
UNION ALL
SELECT x,    NULL, COUNT(*) FROM table GROUP BY x
UNION ALL
SELECT x,    y,    COUNT(*) FROM table GROUP BY x, y

ROLLUP 和 CUBE 来自数据仓库扩展，所以如果你想更好地理解它是如何工作的，你还可以查看你最喜欢的 RDMBS 的文档.例如，PostgreSQL 在 9.5 和这些都有相对完善的文档记录.

ROLLUP and CUBE come from data warehousing extensions so if you want to get a better understanding how this works you can also check documentation of your favorite RDMBS. For example PostgreSQL introduced both in 9.5 and these are relatively well documented.

这篇关于多维数据集、汇总和 groupBy 运算符之间有什么区别?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

多维数据集、汇总和 groupBy 运算符之间有什么区别? [英] What is the difference between cube, rollup and groupBy operators?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

多维数据集、汇总和 groupBy 运算符之间有什么区别? [英] What is the difference between cube, rollup and groupBy operators?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭