Hive cluster by vs order by vs sort by [英] Hive cluster by vs order by vs sort by

查看:16
本文介绍了Hive cluster by vs order by vs sort by的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据我所知;

  • sort by 只在reducer中排序

  • sort by only sorts with in the reducer

order by 全局排序,但将所有东西都推到一个减速器中

order by orders things globally but shoves everything into one reducers

cluster by 智能地通过 key hash 将东西分配到 reducer 并进行排序

cluster by intelligently distributes stuff into reducers by the key hash and make a sort by

所以我的问题是通过保证全局顺序进行集群吗?Distribution by 将相同的键放入相同的减速器中,但是相邻的键呢?

So my question is does cluster by guarantee a global order? distribute by puts the same keys into same reducers but what about the adjacent keys?

我能找到的唯一文档是这里和来自这个例子似乎是在全球范围内订购它们.但从定义来看,我觉得它并不总是那样做.

The only document I can find on this is here and from the example it seems like it orders them globally. But from the definition I feel like it doesn't always do that.

推荐答案

简短的回答:是的,CLUSTER BY 保证全局排序,前提是您愿意自己加入多个输出文件.

A shorter answer: yes, CLUSTER BY guarantees global ordering, provided you're willing to join the multiple output files yourself.

更长的版本:

  • ORDER BY x:保证全局排序,但通过仅将所有数据推送到一个减速器来实现这一点.对于大型数据集,这基本上是不可接受的.您最终会得到一个已排序的文件作为输出.
  • SORT BY x:在N个reducer中的每一个对数据进行排序,但每个reducer可以接收重叠范围的数据.您最终会得到 N 个或更多具有重叠范围的排序文件.
  • DISTRIBUTE BY x:确保N个reducer中的每一个都获得不重叠的x范围,但不对每个reducer的输出进行排序.您最终会得到 N 个或更多个范围不重叠的未排序文件.
  • CLUSTER BY x:确保 N 个 reducer 中的每一个都获得不重叠的范围,然后在 reducer 处按这些范围进行排序.这为您提供了全局排序,与 (DISTRIBUTE BY xSORT BY x) 相同.您最终会得到 N 个或更多个范围不重叠的排序文件.
  • ORDER BY x: guarantees global ordering, but does this by pushing all data through just one reducer. This is basically unacceptable for large datasets. You end up one sorted file as output.
  • SORT BY x: orders data at each of N reducers, but each reducer can receive overlapping ranges of data. You end up with N or more sorted files with overlapping ranges.
  • DISTRIBUTE BY x: ensures each of N reducers gets non-overlapping ranges of x, but doesn't sort the output of each reducer. You end up with N or more unsorted files with non-overlapping ranges.
  • CLUSTER BY x: ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers. This gives you global ordering, and is the same as doing (DISTRIBUTE BY x and SORT BY x). You end up with N or more sorted files with non-overlapping ranges.

有意义吗?所以 CLUSTER BY 基本上是 ORDER BY 的更具可扩展性的版本.

Make sense? So CLUSTER BY is basically the more scalable version of ORDER BY.

这篇关于Hive cluster by vs order by vs sort by的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆