蜂巢左外连接长期运行 [英] hive left outer join long running

查看:44
本文介绍了蜂巢左外连接长期运行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Hortonworks HDP 2.3.0 - Hive 0.14

Hortonworks HDP 2.3.0 - Hive 0.14

Table T1(在 col1 上分区,无桶,ORC) app 1.2 亿行 &6GB 数据大小Table T2(col2 上的分区,无桶,ORC) app 200 M 行 &6MB 数据大小

Table T1 ( partition on col1, no bucket, ORC ) app 120 million rows & 6GB datasize Table T2 ( partition on col2, no bucket, ORC ) app 200 M rows & 6MB datasize

T1 left outer join on t2 ( t1.col3 = t2.col3 )

上述查询在 tez 和 tez 的最后一个减速器阶段长时间运行.先生模式.我也试过自动转换真/假&显式映射连接.

The above query is long running in the last reducer phase in both tez & mr mode. I also tried auto convert true / false & explicit mapjoin.

查询仍然在最后一个 reducer 阶段运行,永远不会结束.

Still the query is running in the last reducer phase, never ending.

仅供参考 - 如果 T2 的数据大小为 9k 或 1GB,则查询完成.

FYI - If data size of T2 is either 9k or 1GB, the query finishes.

推荐答案

问题可能是每个 reducer 有太多字节/行.如果应用程序执行卡在最后一个单个reducer 中,那么很可能是数据倾斜.要检查它,请从两个表中选择前 5 个 col3,偏斜是指有很多具有相同键值的记录(比如 30%).如果是偏斜,则尝试单独加入偏斜键,然后将 UNION ALL 与所有其他键连接.像这样:

The problem maybe is that there are too many bytes/rows per reducer. If the application execution is stuck in the last single reducer then it's most probably data skew. To check it, select top 5 col3 from both tables, skew is when there are a lot of records with the same key value(say 30%). If it's a skew then try to join separately skew key then UNION ALL with all other keys join. Something like this:

select * from
T1 left outer join on t2 on ( t1.col3 = t2.col3 ) and t1.col3=SKEW_VALUE
union all
select * from
T1 left outer join on t2 on ( t1.col3 = t2.col3 ) and t1.col3<>SKEW_VALUE 

如果应用程序执行卡在最后一个 reducer 阶段,而不是一个 reducer 或几个 reducer,那么检查 bytes.per.reducer hive 设置,可能是太高了.

If the application execution is stuck in the last reducer stage, not a single reducer or few reducers, then check bytes.per.reducer hive setting, maybe it's too high.

set hive.exec.reducers.bytes.per.reducer=67108864;

这篇关于蜂巢左外连接长期运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆