从具有多个分区列的配置单元表中获取最新数据 [英] get latest data from hive table with multiple partition columns

查看:61
本文介绍了从具有多个分区列的配置单元表中获取最新数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有以下结构的配置单元表

I have a hive table with below structure

ID string,
Value string,
year int,
month int,
day int,
hour int,
minute int

该表每15分钟刷新一次,并按年/月/日/小时/分钟列进行分区.请在分区上找到以下示例.

This table is refreshed every 15 mins and it is partitioned with year/month/day/hour/minute columns. Please find below samples on partitions.

year=2019/month=12/day=29/hour=19/minute=15
year=2019/month=12/day=30/hour=00/minute=45
year=2019/month=12/day=30/hour=08/minute=45
year=2019/month=12/day=30/hour=09/minute=30
year=2019/month=12/day=30/hour=09/minute=45

我只想从表中选择最新的分区数据.我试图对那些分区列使用max()语句,但是由于数据量巨大,它的效率不是很高.请让我知道,如何使用Hive sql以方便的方式获取数据.

I want to select only latest partition data from the table. I tried to use max() statements with those partition columns, but its not very efficient as data size is huge. Please let me know, how can i get the data in a convenient way using hive sql.

推荐答案

如果最新分区始终位于当前日期,则可以过滤当前日期分区,并使用rank()查找具有最新小时,分钟的记录:

If the latest partition is always in current date, then you can filter current date partition and use rank() to find records with latest hour, minute:

select * --list columns here
from
(
select s.*, rank() over(order by hour desc, minute desc) rnk
  from your_table s
 where s.year=year(current_date)   --filter current day (better pass variables calculated if possible)
   and s.month=lpad(month(current_date),2,0) 
   and s.day=lpad(day(current_date),2,0)
   -- and s.hour=lpad(hour(current_timestamp),2,0) --consider also adding this
) s 
where rnk=1 --latest hour, minute

如果最新分区不一定等于current_date,则可以使用 rank()(按s.year desc,s.month desc,s.day desc,hour desc,minute desc排序),如果没有日期过滤器,则会扫描所有表格,效率不高.

And if the latest partition is not necessarily equals current_date then you can use rank() over (order by s.year desc, s.month desc, s.day desc, hour desc, minute desc), without filter on date this will scan all the table and is not efficient.

如果您可以在shell中计算分区过滤器并作为参数传递,它将表现最佳.查看代码中的注释.

It will perform the best if you can calculate partition filters in the shell and pass as parameters. See comments in the code.

这篇关于从具有多个分区列的配置单元表中获取最新数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆