更有效的查询以避免Hive中的OutOfMemoryError [英] More efficient query to avoid OutOfMemoryError in Hive

查看:473
本文介绍了更有效的查询以避免Hive中的OutOfMemoryError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Hive中遇到异常:

I'm getting an exception in Hive:

java.lang.OutOfMemoryError:超出了GC开销限制.

java.lang.OutOfMemoryError: GC overhead limit exceeded.

在搜索中,我发现这是因为该进程所有CPU时间的98%将进行垃圾回收(这意味着什么?).是我查询中问题的核心吗?为避免这种问题,我是否应该以其他方式写以下内容?

In searching I've found that is because 98% of all CPU time of the process is going to garbage collect (whatever that means?). Is the core of my issue in my query? Should I be writing the below in a different way to avoid this kind of problem?

我试图计算在给定时间段内有多少种特定电话类型具有有效的使用"状态.有没有办法以不同的方式来做这种逻辑,那会更好?

I'm trying to count how many of a certain phone type have an active 'Use' in a given time period. Is there a way to do this logic differently, that would run better?

select count(a.imei)
from
(Select distinct imei
from pingdata
where timestamp between TO_DATE("2016-06-01") AND TO_DATE("2016-07-17")
and ((SUBSTR(imei,12,2) = "04") or (SUBSTR(imei,12,2) = "05")) ) a
join
(SELECT distinct imei
FROM eventdata
where timestamp between TO_DATE("2016-06-01") AND TO_DATE("2016-07-17")
AND event = "Use" AND clientversion like '3.2%') b
on a.imei=b.imei

谢谢

推荐答案

在加入每个数据集之前对每个数据集进行单独应用比较安全,因为加入非唯一键会重复数据.

Applying distinct to each dataset before joining them is safer because joining not unique keys will duplicate data.

我建议按to_date(timestamp)字段(yyyy-MM-dd)对数据集进行分区,以根据您的where子句进行分区修剪(检查它是否起作用).如果数据集太大并且包含大量数据,并且事件<>使用",则按事件字段进行分区.

I would recommend to partition your datasets by to_date(timestamp) field (yyyy-MM-dd) to make partition pruning work according to your where clause (check it works). Partition also by event field if datasets are too big and contain a lot of data where event <> 'Use'.

重要的是要知道它在哪个阶段失败.还要研究异常.如果它在映射器上失败,那么您应该优化子查询(如我提到的那样添加分区).如果在reducer(join)上失败,那么您应该以某种方式改进join(尝试减少每个reducer的字节数:

It's important to know on which stage it fails. Study the exception as well. If it fails on mappers then you should optimize your subqueries (add partitions as I mentioned). if it fails on reducer (join) then you should somehow improve join (try to reduce bytes per reducer:

set hive.exec.reducers.bytes.per.reducer=67108864;或什至更少),如果它在写入器上失败(OrcWriter,然后尝试通过imei的substr将分区添加到输出表,并在查询结束时通过substr(imei ...)进行分配"以减轻对减速器).

set hive.exec.reducers.bytes.per.reducer=67108864; or even less) if it fails on writer (OrcWriter then try to add partition to Output table by substr from imei and 'distribute by substr(imei...)` at the end of query to reduce pressure on reducers).

或添加基数低且分布均匀的更多列,以在更多化简器之间平均分配数据:

Or add une more column with low cardinality and even distribution to distribute the data between more reducers evenly:

distribute by substr(imei...), col2

确保分区列位于分发依据"中.这样可以减少每个reducer写入的文件数量,并有助于摆脱OOM

Make sure that partition column is in the distribute by. This will reduce the number of files written by each reducer and help to get rid of OOM

这篇关于更有效的查询以避免Hive中的OutOfMemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆