MapReduce阶段使用Sort的原因是什么? [英] Where is Sort used in MapReduce phase and why?

查看:294
本文介绍了MapReduce阶段使用Sort的原因是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新来的hadoop在这里。目前尚不清楚为什么我们需要在使用hadoop mapreduce的同时按键排序?在映射阶段之后,我们需要将与每个唯一键相对应的数据分配给一定数量的缩减器。这可以在不需要对其进行排序的情况下完成?

解决方案

它就在那里,因为排序是一种整洁的技巧来分组您的密钥。当然,如果你的工作或算法不需要你的密钥的任何顺序,那么通过一些散列技巧你可以更快地进行分组。



在Hadoop本身中, JIRA已经提交了多年的申请(来源)。
Hadoop之上的其他几个发行版本已经具备了这些功能,例如Hanborq(他们称之为排序避免)。 (来源



<对于您的实际问题(为什么),MapReduce本质上是来自Google的一篇论文(对按增加的按键顺序进行处理。这种订购保证
可以很容易地为每个分区生成一个排序后的输出文件,当输出文件格式需要支持有效的随机
访问查找时, b是有用的
输出的用户发现它便于
对数据进行排序。

所以这是一个更方便的决定,支持排序,但不是固有的只允许排序键组合。


I am new to hadoop here. It is not clear why we need to be able to sort by keys while using hadoop mapreduce ? After map phase, we need to distribute the data corresponding to each unique key to some number of reducers. This can be done without having the need to sort it right ?

解决方案

It is there, because sorting is a neat trick to group your keys. Of course, if your job or algorithm does not need any order of your keys, then you will be faster to group by some hashing trick.

In Hadoop itself, there is already a JIRA filed for that since years (source). Several other distributions that layer on top of Hadoop have these features already, Hanborq for example (they call it sort avoidance). (source)

To your actual question (Why), MapReduce was inherently a paper from Google (source) which states the following:

We guarantee that within a given partition, the intermediate key/value pairs are processed in increasing key order. This ordering guarantee makes it easy to generate a sorted output file per partition, which is useful when the output file format needs to support efficient random access lookups by key, or users of the output find it convenient to have the data sorted.

So it was more a convenience decision to support sort, but not to inherently only allow sort to group keys.

这篇关于MapReduce阶段使用Sort的原因是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆