Hive 中的窗口函数 [英] Windowing function in Hive

查看:23
本文介绍了Hive 中的窗口函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在 Hive 中探索窗口函数,并且能够理解所有 UDF 的功能.虽然,我无法理解我们与其他函数一起使用的分区依据和排序依据.以下是与我计划构建的查询非常相似的结构.

I am exploring windowing functions in Hive and I am able to understand the functionalities of all the UDFs. Although, I am not able to understand the partition by and order by that we use with the other functions. Following is the structure that is very similar to the query which I am planning to build.

SELECT a, RANK() OVER(partition by b order by c) as d from xyz; 

只是想了解这两个关键字所涉及的后台过程.

Just trying to understand the background process involved for both keywords.

感谢帮助:)

推荐答案

RANK() 解析函数为数据集中每个分区的每一行分配一个等级.

RANK() analytic function assigns a rank to each row in each partition in the dataset.

PARTITION BY 子句确定如何分配行(如果是 hive,则在 reducer 之间).

PARTITION BY clause determines how the rows to be distributed (between reducers if it is hive).

ORDER BY 确定行在分区中的排序方式.

ORDER BY determines how the rows are being sorted in the partition.

第一阶段由分布,数据集中的所有行都分布到分区中.在 map-reduce 中,每个映射器根据 partition by 对行进行分组,并为每个分区生成文件.Mapper根据order by对分区部分进行初始排序.

First phase is distribute by, all rows in a dataset are distributed into partitions. In map-reduce each mapper groups rows according to the partition by and produces files for each partition. Mapper does initial sorting of partition parts according to the order by.

第二阶段,所有行在每个分区内进行排序.在map-reduce中,每个reducer获取mapper产生的分区文件(partitions的一部分),并按照order by对整个分区中的行进行排序(部分结果的排序).

Second phase, all rows are sorted inside each partition. In map-reduce, each reducer gets partitions files (parts of partitions) produced by mappers and sorts rows in the whole partition (sort of partial results) according to the order by.

第三,rank函数为分区中的每一行分配rank.正在为每个分区初始化 Rank 函数.

Third, rank function assigns rank to each row in a partition. Rank function is being initialized for each partition.

对于分区中的第一行,从 1 开始.对于下一行 Rank=previous row rank+1.具有相同值(按顺序指定)的行给予相同的排名,如果两行共享相同的排名,则下一行排名不连续.

For the first row in the partition rank starts with 1. For each next row Rank=previous row rank+1. Rows with equal values (specified in the order by) given the same rank, if the two rows share the same rank, next row rank is not consecutive.

不同的分区可以在不同的reducer上并行处理.小分区可以在同一个reducer上处理.Rank 函数在跨越分区边界时重新初始化,并从每个分区的 rank=1 开始.

Different partitions can be processed in parallel on different reducers. Small partitions can be processed on the same reducer. Rank function re-initializes when it crossing the partition boundary and starts with rank=1 for each partition.

示例(行已在分区内进行分区和排序):

Example (rows are already partitioned and sorted inside partitions):

SELECT a, RANK() OVER(partition by b order by c) as d from xyz; 

a, b, c, d(rank)
----------------
1  1  1  1 --starts with 1
2  1  1  1 --the same c value, the same rank=1
3  1  2  3 --rank 2 is skipped because second row shares the same rank as first 

4  2  3  1 --New partition starts with 1
5  2  4  2
6  2  5  3

如果你需要连续的rank,使用dense_rank函数.dense_rank 将为上述数据集中的第三行生成 rank=2.

If you need consecutive ranks, use dense_rank function. dense_rank will produce rank=2 for the third row in the above dataset.

row_number 函数将从 1 开始为分区中的每一行分配一个位置编号.具有相同值的行将收到不同的连续编号.

row_number function will assign a position number to each row in the partition starting with 1. Rows with equal values will receive different consecutive numbers.

SELECT a, ROW_NUMBER() OVER(partition by b order by c) as d from xyz; 

a, b, c, d(row_number)
----------------
1  1  1  1 --starts with 1
2  1  1  2 --the same c value, row number=2
3  1  2  3 --row position=3

4  2  3  1 --New partition starts with 1
5  2  4  2
6  2  5  3

重要说明:对于具有相同值的行 row_number 或其他此类分析函数可能具有非确定性行为并在每次运行时产生不同的数字.上述数据集中的第一行可能会收到数字 2,第二行可能会收到数字 1,反之亦然,因为它们的顺序是不确定的,除非您在 order by 中再添加一列 a 子句.在这种情况下,从运行到运行,所有行将始终具有相同的 row_number,它们的顺序值不同.

Important note: For rows with the same values row_number or other such analytic function may have non-deterministic behavior and produce different numbers from run to run. First row in the above dataset may receive number 2 and second row may receive number 1 and vice-versa, because their order is not determined unless you will add one more column a to the order by clause. In this case all rows will always have the same row_number from run to run, their order values are different.

这篇关于Hive 中的窗口函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆