对行进行排序时优化Hive GROUP BY [英] Optimizing Hive GROUP BY when rows are sorted

查看:86
本文介绍了对行进行排序时优化Hive GROUP BY的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下(非常简单的)Hive查询:

I have the following (very simple) Hive query:

select user_id, event_id, min(time) as start, max(time) as end,
       count(*) as total, count(interaction == 1) as clicks
from events_all
group by user_id, event_id;

该表具有以下结构:

user_id                 event_id                time            interaction 
Ex833Lli36nxTvGTA1Dv    juCUv6EnkVundBHSBzQevw  1430481530295   0
Ex833Lli36nxTvGTA1Dv    juCUv6EnkVundBHSBzQevw  1430481530295   1
n0w4uQhOuXymj5jLaCMQ    G+Oj6J9Q1nI1tuosq2ZM/g  1430512179696   0
n0w4uQhOuXymj5jLaCMQ    G+Oj6J9Q1nI1tuosq2ZM/g  1430512217124   0
n0w4uQhOuXymj5jLaCMQ    mqf38Xd6CAQtuvuKc5NlWQ  1430512179696   1

我知道一个事实,即行首先由 user_id 排序,然后再由 event_id 排序.

I know for a fact that rows are sorted first by user_id and then by event_id.

问题是:鉴于行已排序,有没有一种方法可以提示" Hive引擎来优化查询?优化的目的是避免将所有组都保留在内存中,因为这一次仅需保留一个组即可.

The question is: is there a way to "hint" the Hive engine to optimize the query given that rows are sorted? The purpose of optimization is to avoid keeping all groups in memory since its only necessary to keep one group at a time.

现在,此查询在6节点的16 GB Hadoop集群中运行,该集群具有大约300 GB的数据,大约需要30分钟,并且会占用大部分RAM,这会阻塞系统.我知道每个组都很小,每个(user_id,event_id)元组不超过100行,所以我认为优化的执行可能会占用很小的内存,并且运行速度会更快(因为不需要循环使用组密钥.

Right now this query running in a 6-node 16 GB Hadoop cluster with roughly 300 GB of data takes about 30 minutes and uses most of the RAM, choking the system. I know that each group will be small, no more than 100 rows per (user_id, event_id) tuple, so I think an optimized execution will probably have a very small memory footprint and also be faster (since there is no need to loopup group keys).

推荐答案

创建存储分区的排序表.优化器将知道它是根据元数据排序的.请参阅此处的示例(官方文档): https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-BucketedSortedTable

Create a bucketed sorted table. The optimizer will know it sorted from metadata. See example here (official docs): https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-BucketedSortedTables

仅计算互动= 1:计数(互动= 1,然后结束1的情况)为点击次数-案例将所有行标记为1或为空,并且仅计数1s.

Count only interaction = 1: count(case when interaction=1 then 1 end) as clicks - case will mark all rows with 1 or null and count only 1s.

这篇关于对行进行排序时优化Hive GROUP BY的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆