多维数据集,汇总和groupBy运算符有什么区别? [英] What is the difference between cube, rollup and groupBy operators?

查看:103
本文介绍了多维数据集,汇总和groupBy运算符有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

标题中几乎有问题.我找不到有关差异的详细文档.

Question is pretty much in the title. I can't find any detailed documentation regarding the differences.

我确实注意到了一个差异,因为在交换多维数据集和groupBy函数调用时,我得到了不同的结果.我注意到,对于使用多维数据集"的结果,我经常将其分组的表达式上有很多空值.

I do notice a difference because when interchanging cube and groupBy function calls, I get different results. I noticed that for the result using 'cube', I got a lot of null values on the expressions I often grouped by.

推荐答案

这些方法不能以相同的方式工作. groupBy只是标准SQL中GROUP BY子句的等效项.换句话说

These are not intended to work in the same way. groupBy is simply an equivalent of the GROUP BY clause in standard SQL. In other words

table.groupBy($"foo", $"bar")

等效于:

SELECT foo, bar, [agg-expressions] FROM table GROUP BY foo, bar

cube等效于GROUP BYCUBE扩展名.它获取列列表,并将聚合表达式应用于分组列的所有可能的组合.假设您有这样的数据:

cube is equivalent to CUBE extension to GROUP BY. It takes a list of columns and applies aggregate expressions to all possible combinations of the grouping columns. Lets say you have data like this:

val df = Seq(("foo", 1L), ("foo", 2L), ("bar", 2L), ("bar", 2L)).toDF("x", "y")

df.show

// +---+---+
// |  x|  y|
// +---+---+
// |foo|  1|
// |foo|  2|
// |bar|  2|
// |bar|  2|
// +---+---+

,并且您将cube(x, y)计算为总数:

and you compute cube(x, y) with count as an aggregation:

df.cube($"x", $"y").count.show

// +----+----+-----+     
// |   x|   y|count|
// +----+----+-----+
// |null|   1|    1|   <- count of records where y = 1
// |null|   2|    3|   <- count of records where y = 2
// | foo|null|    2|   <- count of records where x = foo
// | bar|   2|    2|   <- count of records where x = bar AND y = 2
// | foo|   1|    1|   <- count of records where x = foo AND y = 1
// | foo|   2|    1|   <- count of records where x = foo AND y = 2
// |null|null|    4|   <- total count of records
// | bar|null|    2|   <- count of records where x = bar
// +----+----+-----+

cube相似的函数是rollup,它从左到右计算层次结构小计:

A similar function to cube is rollup which computes hierarchical subtotals from left to right:

df.rollup($"x", $"y").count.show
// +----+----+-----+
// |   x|   y|count|
// +----+----+-----+
// | foo|null|    2|   <- count where x is fixed to foo
// | bar|   2|    2|   <- count where x is fixed to bar and y is fixed to  2
// | foo|   1|    1|   ...
// | foo|   2|    1|   ...
// |null|null|    4|   <- count where no column is fixed
// | bar|null|    2|   <- count where x is fixed to bar
// +----+----+-----+

为了比较,让我们看一下普通groupBy的结果:

Just for comparison lets see the result of plain groupBy:

df.groupBy($"x", $"y").count.show

// +---+---+-----+
// |  x|  y|count|
// +---+---+-----+
// |foo|  1|    1|   <- this is identical to x = foo AND y = 1 in CUBE or ROLLUP
// |foo|  2|    1|   <- this is identical to x = foo AND y = 2 in CUBE or ROLLUP
// |bar|  2|    2|   <- this is identical to x = bar AND y = 2 in CUBE or ROLLUP
// +---+---+-----+

总结:

  • 使用普通GROUP BY时,每行在其相应的摘要中仅包含一次.
  • 对于GROUP BY CUBE(..),每一行都包括在它表示的级别的每个组合的摘要中,包括通配符.从逻辑上讲,上面显示的内容等效于以下内容(假设我们可以使用NULL占位符):

  • When using plain GROUP BY every row is included only once in its corresponding summary.
  • With GROUP BY CUBE(..) every row is included in summary of each combination of levels it represents, wildcards included. Logically, the shown above is equivalent to something like this (assuming we could use NULL placeholders):

SELECT NULL, NULL, COUNT(*) FROM table
UNION ALL
SELECT x,    NULL, COUNT(*) FROM table GROUP BY x
UNION ALL
SELECT NULL, y,    COUNT(*) FROM table GROUP BY y
UNION ALL
SELECT x,    y,    COUNT(*) FROM table GROUP BY x, y

  • 使用GROUP BY ROLLUP(...)CUBE相似,但是通过从左到右填充列来进行分层工作.

  • With GROUP BY ROLLUP(...) is similar to CUBE but works hierarchically by filling colums from left to right.

    SELECT NULL, NULL, COUNT(*) FROM table
    UNION ALL
    SELECT x,    NULL, COUNT(*) FROM table GROUP BY x
    UNION ALL
    SELECT x,    y,    COUNT(*) FROM table GROUP BY x, y
    

  • ROLLUPCUBE来自数据仓库扩展,因此,如果您想更好地了解其工作原理,还可以查看自己喜欢的RDMBS的文档.例如PostgreSQL在9.5和中都有介绍,这些文献相对完整.

    ROLLUP and CUBE come from data warehousing extensions so if you want to get a better understanding how this works you can also check documentation of your favorite RDMBS. For example PostgreSQL introduced both in 9.5 and these are relatively well documented.

    这篇关于多维数据集,汇总和groupBy运算符有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆