Hive集群通过vs排序通过vs排序 [英] Hive cluster by vs order by vs sort by

查看:102
本文介绍了Hive集群通过vs排序通过vs排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据我了解,




  • 按照简化顺序排序


  • 按全局顺序排列事物,但将所有内容都集中到一个reducer中。


  • 并按照



进行排序。所以我的问题是通过保证全局顺序来确保集群吗?通过将相同的密钥分配到相同的缩减器中,但相邻的密钥又如何?



我能找到的唯一文件是这里,从这个例子看来,它似乎在全球订购它们。但从定义上来说,我觉得它并不总是这样。 简短回答:是, CLUSTER BY 保证全局排序,假设您愿意自己加入多个输出文件。



更长的版本:




  • ORDER BY x :保证全局排序,但是通过只将一个数据减速器。对于大型数据集来说,这基本上是不可接受的。您最终得到一个排序文件作为输出。
  • SORT BY x :在N个缩减器中的每一个处订购数据,但每个缩减器都可以接收重叠的数据范围。您最终会得到N个或多个重叠范围的排序文件。
  • 范围 x ,但不排序每个缩减器的输出。最终得到N个或未排序的文件,其中包含非重叠范围。
  • CLUSTER BY x 重叠范围,然后在减速器中按这些范围进行排序。这为您提供全局排序,与执行( DISTRIBUTE BY x SORT BY x )相同。您最终会得到N个或更多的非重叠范围的排序文件。



有意义吗?因此 CLUSTER BY 基本上是 ORDER BY 的可扩展版本。


As far as I understand;

  • sort by only sorts with in the reducer

  • order by orders things globally but shoves everything into one reducers

  • cluster by intelligently distributes stuff into reducers by the key hash and make a sort by

So my question is does cluster by guarantee a global order? distribute by puts the same keys into same reducers but what about the adjacent keys?

The only document I can find on this is here and from the example it seems like it orders them globally. But from the definition I feel like it doesn't always do that.

解决方案

A shorter answer: yes, CLUSTER BY guarantees global ordering, provided you're willing to join the multiple output files yourself.

The longer version:

  • ORDER BY x: guarantees global ordering, but does this by pushing all data through just one reducer. This is basically unacceptable for large datasets. You end up one sorted file as output.
  • SORT BY x: orders data at each of N reducers, but each reducer can receive overlapping ranges of data. You end up with N or more sorted files with overlapping ranges.
  • DISTRIBUTE BY x: ensures each of N reducers gets non-overlapping ranges of x, but doesn't sort the output of each reducer. You end up with N or unsorted files with non-overlapping ranges.
  • CLUSTER BY x: ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers. This gives you global ordering, and is the same as doing (DISTRIBUTE BY x and SORT BY x). You end up with N or more sorted files with non-overlapping ranges.

Make sense? So CLUSTER BY is basically the more scalable version of ORDER BY.

这篇关于Hive集群通过vs排序通过vs排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆