MapReduce阶段在哪里使用Sort,为什么? [英] Where is Sort used in MapReduce phase and why?

查看:12
本文介绍了MapReduce阶段在哪里使用Sort,为什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是这里的 hadoop 新手.目前尚不清楚为什么我们需要能够在使用 hadoop mapreduce 时按键排序?在 map 阶段之后,我们需要将每个唯一键对应的数据分发到一定数量的 reducer.这可以在不需要排序的情况下完成吗?

I am new to hadoop here. It is not clear why we need to be able to sort by keys while using hadoop mapreduce ? After map phase, we need to distribute the data corresponding to each unique key to some number of reducers. This can be done without having the need to sort it right ?

推荐答案

它就在那里,因为排序是对你的键进行分组的巧妙技巧.当然,如果您的工作或算法不需要任何密钥顺序,那么通过一些散列技巧进行分组会更快.

It is there, because sorting is a neat trick to group your keys. Of course, if your job or algorithm does not need any order of your keys, then you will be faster to group by some hashing trick.

在 Hadoop 本身中,多年来已经为此提交了一个 JIRA(来源).Hadoop 之上的其他几个发行版已经具有这些功能,例如 Hanborq(他们称之为避免排序).(来源)

In Hadoop itself, there is already a JIRA filed for that since years (source). Several other distributions that layer on top of Hadoop have these features already, Hanborq for example (they call it sort avoidance). (source)

对于您的实际问题(为什么),MapReduce 本质上是 Google 的一篇论文(source) 声明如下:

To your actual question (Why), MapReduce was inherently a paper from Google (source) which states the following:

我们保证在给定的分区内,中间键/值对以递增的密钥顺序处理.此订购保证可以很容易地为每个分区生成一个排序的输出文件,当输出文件格式需要支持高效随机时很有用通过键访问查找,或输出的用户发现它很方便对数据进行排序.

We guarantee that within a given partition, the intermediate key/value pairs are processed in increasing key order. This ordering guarantee makes it easy to generate a sorted output file per partition, which is useful when the output file format needs to support efficient random access lookups by key, or users of the output find it convenient to have the data sorted.

因此,支持排序更多是一个方便的决定,但并不是天生就只允许排序对键进行分组.

So it was more a convenience decision to support sort, but not to inherently only allow sort to group keys.

这篇关于MapReduce阶段在哪里使用Sort,为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆