Hive中的窗口功能 [英] Windowing function in Hive

查看:157
本文介绍了Hive中的窗口功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在探索Hive中的窗口功能,并且能够理解所有UDF的功能.虽然,我无法理解我们与其他功能配合使用的分区和顺序.以下是与我计划构建的查询非常相似的结构.

I am exploring windowing functions in Hive and I am able to understand the functionalities of all the UDFs. Although, I am not able to understand the partition by and order by that we use with the other functions. Following is the structure that is very similar to the query which I am planning to build.

SELECT a, RANK() OVER(partition by b order by c) as d from xyz; 

只需尝试了解两个关键字所涉及的后台流程即可.

Just trying to understand the background process involved for both keywords.

感谢帮助:)

推荐答案

RANK()分析函数为数据集中每个分区的每一行分配一个等级.

RANK() analytic function assigns a rank to each row in each partition in the dataset.

PARTITION BY子句确定行的分配方式(如果是蜂巢,则在缩减程序之间).

PARTITION BY clause determines how the rows to be distributed (between reducers if it is hive).

ORDER BY确定分区中行的排序方式.

ORDER BY determines how the rows are being sorted in the partition.

第一阶段分配,数据集中的所有行均分配到分区中.在map-reduce中,每个映射器根据partition by对行进行分组,并为每个分区生成文件.映射器根据order by对分区部分进行初始排序.

First phase is distribute by, all rows in a dataset are distributed into partitions. In map-reduce each mapper groups rows according to the partition by and produces files for each partition. Mapper does initial sorting of partition parts according to the order by.

第二阶段,所有行都在每个分区内排序. 在map-reduce中,每个化简器获取由映射器生成的分区文件(分区的一部分),并根据order by对整个分区中的行进行排序(部分结果的排序).

Second phase, all rows are sorted inside each partition. In map-reduce, each reducer gets partitions files (parts of partitions) produced by mappers and sorts rows in the whole partition (sort of partial results) according to the order by.

第三,等级功能将等级分配给分区中的每一行.正在为每个分区初始化Rank函数.

Third, rank function assigns rank to each row in a partition. Rank function is being initialized for each partition.

对于分区行中的第一行,其开头为1.对于每一行,其Rank=previous row rank+1.具有相同值(按顺序指定)的行具有相同的等级,如果两行共享相同的等级,则下一行不是连续的.

For the first row in the partition rank starts with 1. For each next row Rank=previous row rank+1. Rows with equal values (specified in the order by) given the same rank, if the two rows share the same rank, next row rank is not consecutive.

不同的分区可以在不同的reducer上并行处理.小型分区可以在相同的reducer上进行处理.等级函数在跨越分区边界时会重新初始化,并以每个分区的rank = 1开头.

Different partitions can be processed in parallel on different reducers. Small partitions can be processed on the same reducer. Rank function re-initializes when it crossing the partition boundary and starts with rank=1 for each partition.

示例(行已在分区内进行了分区和排序):

Example (rows are already partitioned and sorted inside partitions):

SELECT a, RANK() OVER(partition by b order by c) as d from xyz; 

a, b, c, d(rank)
----------------
1  1  1  1 --starts with 1
2  1  1  1 --the same c value, the same rank=1
3  1  2  3 --rank 2 is skipped because second row shares the same rank as first 

4  2  3  1 --New partition starts with 1
5  2  4  2
6  2  5  3

如果需要连续等级,请使用dense_rank功能. dense_rank将为上述数据集中的第三行产生rank = 2.

If you need consecutive ranks, use dense_rank function. dense_rank will produce rank=2 for the third row in the above dataset.

row_number函数将为分区中从1开始的每一行分配位置编号.具有相等值的行将接收不同的连续编号.

row_number function will assign a position number to each row in the partition starting with 1. Rows with equal values will receive different consecutive numbers.

SELECT a, ROW_NUMBER() OVER(partition by b order by c) as d from xyz; 

a, b, c, d(row_number)
----------------
1  1  1  1 --starts with 1
2  1  1  2 --the same c value, row number=2
3  1  2  3 --row position=3

4  2  3  1 --New partition starts with 1
5  2  4  2
6  2  5  3

重要说明:对于具有相同值的行row_number或其他此类分析功能,可能具有不确定性,并且每次运行时产生的数字都不相同.上面数据集中的第一行可能会接收数字2,第二行可能会接收数字1,反之亦然,因为除非确定您要在order by子句中再添加一列 a ,否则它们的顺序是不确定的.在这种情况下,每次运行时所有行将始终具有相同的row_number,它们的顺序值是不同的.

Important note: For rows with the same values row_number or other such analytic function may have non-deterministic behavior and produce different numbers from run to run. First row in the above dataset may receive number 2 and second row may receive number 1 and vice-versa, because their order is not determined unless you will add one more column a to the order by clause. In this case all rows will always have the same row_number from run to run, their order values are different.

这篇关于Hive中的窗口功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆