性能调优Hive查询 [英] Performance tuning a Hive query

查看:115
本文介绍了性能调优Hive查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Hive查询,它选择大约30列和大约40万条记录,并将它们插入到另一个表中。我有一个加入我的SQL子句,这只是一个内部连接。



由于超过了Java GC开销限制,查询失败。

$ b $奇怪的是,如果我删除join子句,并从表中选择数据(稍高一点),则查询工作正常。



<我很熟悉Hive。我不明白为什么这个连接导致内存异常。



有关我如何编写Hive查询以了解这些问题,我应该注意什么?任何人都可以解释为什么连接可能会导致此问题,但是选择更大量的数据并且相同数量的列不会。



欣赏您的想法。
谢谢

解决方案

非常感谢回应马克。非常感谢。



多小时后,我最终发现连接语句中的表的顺序有所不同。为了获得最佳性能和内存管理,最后一次加入应该是最大的表。



更改我的表在连接语句中的顺序修复了问题。
$ b

最后在 http://hive.apache上查看最大表。 org / docs / r0.9.0 / language_manual / joins.html



上面的解释也很有用。非常感谢


I have a Hive query which is selecting about 30 columns and around 400,000 records and inserting them into another table. I have one join in my SQL clause, which is just an inner join.

The query fails because of a Java GC overhead limit exceeded.

What's strange is that if I remove the join clause and just select the data from the table (slightly higher volume) then the query works fine.

I'm pretty new to Hive. I can't understand why this join is causing memory exceptions.

Is there something that I should be aware of with regards to how I write Hive queries so that they don't cause these issues? Could anyone explain why the join might cause this issue but selecting a higher volume of data and the same number of columns does not.

Appreciate your thoughts on this. Thanks

解决方案

Many thanks for the response Mark. Much appreciated.

After many hours I eventually found out that the order of tables in the the join statement makes a difference. For optimum performance and memory management the last join should be the largest table.

Changing the order of my tables in the join statement fixed the issue.

See Largest Table Last at http://hive.apache.org/docs/r0.9.0/language_manual/joins.html

Your explanation above is very useful as well. Many Thanks

这篇关于性能调优Hive查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆