蜂巢数据记录的顺序对于联接表是否重要? [英] Hive Does the order of the data record matters for joining tables

查看:52
本文介绍了蜂巢数据记录的顺序对于联接表是否重要?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道在连接两个表时数据记录的顺序是否重要(在性能方面)?
P.S.我没有使用任何地图端连接或存储桶连接.

I would like to know if the order of the data records matter (performance wise) when joining two tables?
P.S. I am not using any map-side join or bucket join.

谢谢!

推荐答案

一方面,顺序无关紧要,因为在shuffle连接期间,映射器并行读取文件,文件也可能在几个映射器之间分割,反之亦然,一个映射器可以读取几个文件,然后映射器输出传递到每个reducer.而且即使数据是有序的,由于并行性,它也不会按其顺序进行读取和分发.另一方面,根据数据熵,对数据进行排序可以提高压缩率.类似的数据可以更好地压缩.因此,文件排序的压缩文件可以更小,并且在连接查询执行期间可以更快地读取它们.这可能会提高连接速度,因为映射器将更快地读取数据.此外,如果在加载过程中对数据进行了排序,则ORC中的索引可能会更有效地进行过滤.这取决于您的数据熵和所使用的过滤器.

On the one hand order should not matter because during shuffle join files are being read by mappers in parallel, also files may be splitted between few mappers or vice-versa, one mapper can read few files, then mappers output passed to each reducer. And even if data was ordered it is being read and distributed not in it's order due to parallelism. On the other hand, ordering data may improve compression depending on the data entropy. Similar data can be compressed better. Therefore files ordered compressed files can be smaller and they will be read faster during join query execution. This may improve join speed because mappers will read data faster. Also indexes in ORC may work more efficient for filtering if data was ordered during load. It depends on your data entropy and filters you are using.

这篇关于蜂巢数据记录的顺序对于联接表是否重要?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆