Hive集群通过vs排序通过vs排序 [英] Hive cluster by vs order by vs sort by
问题描述
据我了解,
-
按照简化顺序排序
-
按全局顺序排列事物,但将所有内容都集中到一个reducer中。
-
并按照
进行排序。所以我的问题是通过保证全局顺序来确保集群吗?通过将相同的密钥分配到相同的缩减器中,但相邻的密钥又如何?
我能找到的唯一文件是这里,从这个例子看来,它似乎在全球订购它们。但从定义上来说,我觉得它并不总是这样。 简短回答:是, CLUSTER BY
保证全局排序,假设您愿意自己加入多个输出文件。
更长的版本:
-
ORDER BY x
:保证全局排序,但是通过只将一个数据减速器。对于大型数据集来说,这基本上是不可接受的。您最终得到一个排序文件作为输出。
-
SORT BY x
:在N个缩减器中的每一个处订购数据,但每个缩减器都可以接收重叠的数据范围。您最终会得到N个或多个重叠范围的排序文件。 范围 -
CLUSTER BY x
重叠范围,然后在减速器中按这些范围进行排序。这为您提供全局排序,与执行(DISTRIBUTE BY x
和SORT BY x
)相同。您最终会得到N个或更多的非重叠范围的排序文件。
x
,但不排序每个缩减器的输出。最终得到N个或未排序的文件,其中包含非重叠范围。 有意义吗?因此 CLUSTER BY
基本上是 ORDER BY
的可扩展版本。
As far as I understand;
sort by only sorts with in the reducer
order by orders things globally but shoves everything into one reducers
cluster by intelligently distributes stuff into reducers by the key hash and make a sort by
So my question is does cluster by guarantee a global order? distribute by puts the same keys into same reducers but what about the adjacent keys?
The only document I can find on this is here and from the example it seems like it orders them globally. But from the definition I feel like it doesn't always do that.
A shorter answer: yes, CLUSTER BY
guarantees global ordering, provided you're willing to join the multiple output files yourself.
The longer version:
ORDER BY x
: guarantees global ordering, but does this by pushing all data through just one reducer. This is basically unacceptable for large datasets. You end up one sorted file as output.SORT BY x
: orders data at each of N reducers, but each reducer can receive overlapping ranges of data. You end up with N or more sorted files with overlapping ranges.DISTRIBUTE BY x
: ensures each of N reducers gets non-overlapping ranges ofx
, but doesn't sort the output of each reducer. You end up with N or unsorted files with non-overlapping ranges.CLUSTER BY x
: ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers. This gives you global ordering, and is the same as doing (DISTRIBUTE BY x
andSORT BY x
). You end up with N or more sorted files with non-overlapping ranges.
Make sense? So CLUSTER BY
is basically the more scalable version of ORDER BY
.
这篇关于Hive集群通过vs排序通过vs排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!