在Map Reduce Programming中,减速器中的洗牌和分类阶段的目的是什么? [英] What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

查看:183
本文介绍了在Map Reduce Programming中,减速器中的洗牌和分类阶段的目的是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Map Reduce编程中,reduce阶段将其作为其子部分进行混洗,排序和减少。排序是一件昂贵的事情。



Map Reduce Programming中Reducer的混洗和排序阶段的目的是什么?

解决方案首先, shuffling 是将数据从映射器传输到reducer的过程,所以我认为很明显,减速器是必要的,否则,他们将不能够有任何输入(或来自每个映射器的输入)。即使在地图阶段结束之前,洗牌也可以开始,以节省一些时间。这就是为什么当地图状态尚未达到100%时,您可以看到大于0%(但小于33%)的缩小状态。

排序为reducer节省时间,帮助它轻松区分新的reduce任务何时开始。简单地说,当排序的输入数据中的下一个键与前一个键不同时,它只是启动一个新的减少任务。每个reduce任务都会获取一组键值对,但必须调用reduce()方法,该方法采用键列表(值)输入,因此必须按键对值进行分组。如果输入数据在映射阶段预先进行了排序(本地),并在简化阶段进行简单合并排序(因为reducer从许多映射器获取数据),这很容易实现。



分区,你在其中一个答案中提到,是一个不同的过程。它决定了哪个reducer将会发送一个(键,值)对,这个映射阶段的输出。默认的分区程序在密钥上使用散列将它们分发给reduce任务,但是您可以覆盖它并使用您自己的自定义分区程序。



一个很好的信息来源这些步骤是这个



请注意,如果您指定零减速器,则不会执行混洗排序 (setNumReduceTasks(0))。然后,MapReduce作业在地图阶段停止,并且地图阶段不包括任何种类的排序(所以即使地图阶段更快)。

更新:由于您正在寻找更正式的东西,您还可以阅读Tom White的书Hadoop:权威指南。 这里是您的问题的有趣部分。

自2007年2月以来,Tom White一直是Apache Hadoop的提交者,并且是Apache软件基金会的成员,所以我认为它非常可信,并且是官方的......


In Map Reduce programming the reduce phase has shuffling, sorting and reduce as its sub-parts. Sorting is a costly affair.

What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

解决方案

First of all shuffling is the process of transfering data from the mappers to the reducers, so I think it is obvious that it is necessary for the reducers, since otherwise, they wouldn't be able to have any input (or input from every mapper). Shuffling can start even before the map phase has finished, to save some time. That's why you can see a reduce status greater than 0% (but less than 33%) when the map status is not yet 100%.

Sorting saves time for the reducer, helping it easily distinguish when a new reduce task should start. It simply starts a new reduce task, when the next key in the sorted input data is different than the previous, to put it simply. Each reduce task takes a list of key-value pairs, but it has to call the reduce() method which takes a key-list(value) input, so it has to group values by key. It's easy to do so, if input data is pre-sorted (locally) in the map phase and simply merge-sorted in the reduce phase (since the reducers get data from many mappers).

Partitioning, that you mentioned in one of the answers, is a different process. It determines in which reducer a (key, value) pair, output of the map phase, will be sent. The default Partitioner uses a hashing on the keys to distribute them to the reduce tasks, but you can override it and use your own custom Partitioner.

A great source of information for these steps is this Yahoo tutorial.

A nice graphical representation of this is the following (shuffle is called "copy" in this figure):

Note that shuffling and sorting are not performed at all if you specify zero reducers (setNumReduceTasks(0)). Then, the MapReduce job stops at the map phase, and the map phase does not include any kind of sorting (so even the map phase is faster).

UPDATE: Since you are looking for something more official, you can also read Tom White's book "Hadoop: The Definitive Guide". Here is the interesting part for your question.
Tom White has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation, so I guess it is pretty credible and official...

这篇关于在Map Reduce Programming中,减速器中的洗牌和分类阶段的目的是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆