Map Reduce Programming中reducer中shuffling和sorting阶段的目的是什么? [英] What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

查看:37
本文介绍了Map Reduce Programming中reducer中shuffling和sorting阶段的目的是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Map Reduce 编程中,reduce 阶段将 shuffle、排序和 reduce 作为其子部分.排序是一件代价高昂的事情.

Map Reduce Programming中reducer中shuffle和排序阶段的目的是什么?

解决方案

首先shuffling是将数据从mapper传输到reducer的过程,所以我觉得很明显减速器所必需的,否则,它们将无法获得任何输入(或来自每个映射器的输入).甚至可以在地图阶段完成之前开始洗牌,以节省一些时间.这就是为什么当地图状态尚未达到 100% 时,您会看到减少状态大于 0%(但小于 33%).

Sorting 为 reducer 节省时间,帮助它轻松区分何时应该启动新的 reduce 任务.简单地说,当排序的输入数据中的下一个键与前一个不同时,它只是启动一个新的reduce任务.每个reduce 任务都需要一个键值对列表,但它必须调用reduce() 方法,该方法接受一个key-list(value) 输入,因此它必须按键对值进行分组.这样做很容易,如果输入数据在映射阶段预先排序(本地)并在化简阶段简单地合并排序(因为化简器从许多映射器获取数据).

您在其中一个答案中提到的

Partitioning 是一个不同的过程.它确定将在哪个减速器中发送(键,值)对,映射阶段的输出.默认的 Partitioner 使用键上的散列将它们分发给 reduce 任务,但您可以覆盖它并使用您自己的自定义 Partitioner.

这些步骤的重要信息来源是这个

请注意,如果您指定零化简器 (setNumReduceTasks(0)),则根本不会执行 shufflingsorting.然后,MapReduce 作业在映射阶段停止,并且映射阶段不包括任何排序(因此即使是映射阶段也更快).

更新:由于您正在寻找更正式的内容,您还可以阅读 Tom White 的书Hadoop:权威指南".这里 是您问题的有趣部分.
Tom White 自 2007 年 2 月以来一直是 Apache Hadoop 的提交者,并且是 Apache 软件基金会的成员,所以我想这是非常可信和官方的...

In Map Reduce programming the reduce phase has shuffling, sorting and reduce as its sub-parts. Sorting is a costly affair.

What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

解决方案

First of all shuffling is the process of transfering data from the mappers to the reducers, so I think it is obvious that it is necessary for the reducers, since otherwise, they wouldn't be able to have any input (or input from every mapper). Shuffling can start even before the map phase has finished, to save some time. That's why you can see a reduce status greater than 0% (but less than 33%) when the map status is not yet 100%.

Sorting saves time for the reducer, helping it easily distinguish when a new reduce task should start. It simply starts a new reduce task, when the next key in the sorted input data is different than the previous, to put it simply. Each reduce task takes a list of key-value pairs, but it has to call the reduce() method which takes a key-list(value) input, so it has to group values by key. It's easy to do so, if input data is pre-sorted (locally) in the map phase and simply merge-sorted in the reduce phase (since the reducers get data from many mappers).

Partitioning, that you mentioned in one of the answers, is a different process. It determines in which reducer a (key, value) pair, output of the map phase, will be sent. The default Partitioner uses a hashing on the keys to distribute them to the reduce tasks, but you can override it and use your own custom Partitioner.

A great source of information for these steps is this Yahoo tutorial (archived).

A nice graphical representation of this is the following (shuffle is called "copy" in this figure):

Note that shuffling and sorting are not performed at all if you specify zero reducers (setNumReduceTasks(0)). Then, the MapReduce job stops at the map phase, and the map phase does not include any kind of sorting (so even the map phase is faster).

UPDATE: Since you are looking for something more official, you can also read Tom White's book "Hadoop: The Definitive Guide". Here is the interesting part for your question.
Tom White has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation, so I guess it is pretty credible and official...

这篇关于Map Reduce Programming中reducer中shuffling和sorting阶段的目的是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆