更有效的查询以避免 Hive 中的 OutOfMemoryError [英] More efficient query to avoid OutOfMemoryError in Hive

查看:14
本文介绍了更有效的查询以避免 Hive 中的 OutOfMemoryError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Hive 中遇到异常:

I'm getting an exception in Hive:

java.lang.OutOfMemoryError:超出 GC 开销限制.

java.lang.OutOfMemoryError: GC overhead limit exceeded.

在搜索中,我发现这是因为进程的所有 CPU 时间的 98% 都将进行垃圾收集(无论这意味着什么?).我的问题的核心是我的查询吗?我是否应该以不同的方式编写以下内容以避免此类问题?

In searching I've found that is because 98% of all CPU time of the process is going to garbage collect (whatever that means?). Is the core of my issue in my query? Should I be writing the below in a different way to avoid this kind of problem?

我正在尝试计算在给定时间段内有多少特定类型的手机处于使用"状态.有没有办法以不同的方式执行此逻辑,这样会运行得更好?

I'm trying to count how many of a certain phone type have an active 'Use' in a given time period. Is there a way to do this logic differently, that would run better?

select count(a.imei)
from
(Select distinct imei
from pingdata
where timestamp between TO_DATE("2016-06-01") AND TO_DATE("2016-07-17")
and ((SUBSTR(imei,12,2) = "04") or (SUBSTR(imei,12,2) = "05")) ) a
join
(SELECT distinct imei
FROM eventdata
where timestamp between TO_DATE("2016-06-01") AND TO_DATE("2016-07-17")
AND event = "Use" AND clientversion like '3.2%') b
on a.imei=b.imei

谢谢

推荐答案

在加入每个数据集之前对它们应用 distinct 更安全,因为加入非唯一键会重复数据.

Applying distinct to each dataset before joining them is safer because joining not unique keys will duplicate data.

我建议通过 to_date(timestamp) 字段 (yyyy-MM-dd) 对数据集进行分区,以根据您的 where 子句进行分区修剪(检查是否有效).如果数据集太大并且包含大量数据,其中 event <> 'Use',也可以按事件字段进行分区.

I would recommend to partition your datasets by to_date(timestamp) field (yyyy-MM-dd) to make partition pruning work according to your where clause (check it works). Partition also by event field if datasets are too big and contain a lot of data where event <> 'Use'.

了解它在哪个阶段失败很重要.也要研究异常.如果它在映射器上失败,那么你应该优化你的子查询(添加我提到的分区).如果它在减速器(连接)上失败,那么你应该以某种方式改进连接(尝试减少每个减速器的字节数:

It's important to know on which stage it fails. Study the exception as well. If it fails on mappers then you should optimize your subqueries (add partitions as I mentioned). if it fails on reducer (join) then you should somehow improve join (try to reduce bytes per reducer:

set hive.exec.reducers.bytes.per.reducer=67108864; 甚至更少)如果它在 writer 上失败(OrcWriter 然后尝试通过来自 imei 和 ' 的 substr 添加分区到输出表在查询结束时通过 substr(imei...)` 分发以减少对减速器的压力).

set hive.exec.reducers.bytes.per.reducer=67108864; or even less) if it fails on writer (OrcWriter then try to add partition to Output table by substr from imei and 'distribute by substr(imei...)` at the end of query to reduce pressure on reducers).

或者添加一个低基数且分布均匀的列,将数据均匀分布在更多的reducer之间:

Or add une more column with low cardinality and even distribution to distribute the data between more reducers evenly:

distribute by substr(imei...), col2

确保分区列在分发者中.这将减少每个reducer写入的文件数量,有助于摆脱OOM

Make sure that partition column is in the distribute by. This will reduce the number of files written by each reducer and help to get rid of OOM

这篇关于更有效的查询以避免 Hive 中的 OutOfMemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆