Hive 数据记录的顺序是否对连接表很重要 [英] Hive Does the order of the data record matters for joining tables

查看:31
本文介绍了Hive 数据记录的顺序是否对连接表很重要的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道连接两个表时数据记录的顺序是否重要(性能方面)?
附言我没有使用任何地图端连接或桶连接.

I would like to know if the order of the data records matter (performance wise) when joining two tables?
P.S. I am not using any map-side join or bucket join.

谢谢!

推荐答案

一方面,顺序应该无关紧要,因为在 shuffle join 文件被映射器并行读取时,文件也可能被分成几个映射器,反之亦然,一个映射器可以读取几个文件,然后映射器输出传递给每个减速器.而且即使数据被排序,由于并行性,它也不会按顺序读取和分发.

On the one hand order should not matter because during shuffle join files are being read by mappers in parallel, also files may be splitted between few mappers or vice-versa, one mapper can read few files, then mappers output passed to each reducer. And even if data was sorted it is being read and distributed not in it's order due to parallelism.

另一方面,排序可以根据数据熵提高压缩率.类似的数据可以更好地压缩.因此,排序压缩的文件更小,并且在连接查询执行期间读取它们的速度更快.这可能会提高连接速度,因为如果数据在加载期间按过滤器列排序并且启用了 PPD,映射器将更快地读取数据并且 ORC 中的内部索引可以有效地工作.排序和压缩的文件大小可以减少 x3 倍甚至更多,这将导致映射器减少 x3.

On the other hand, sorting improves compression depending on the data entropy. Similar data can be compressed better. Therefore files ordered compressed are smaller and they will be read faster during join query execution. This may improve join speed because mappers will read data faster and internal indexes in ORC work efficiently if data was sorted by filter columns during load and PPD is enabled. Sorted and compressed file size can be reduced x3 times or even more, it will result in x3 less mappers.

当您编写和排序一次并阅读多次时,排序是有效的.

Sorting is efficient when you are writing and sorting once and reading many times.

这篇关于Hive 数据记录的顺序是否对连接表很重要的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆