Map Reduce编程中reducer中的shuffle和sorting阶段的目的是什么? [英] What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

查看:22
本文介绍了Map Reduce编程中reducer中的shuffle和sorting阶段的目的是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Map Reduce 编程中,reduce 阶段具有 shuffle、sorting 和 reduce 作为其子部分.排序是一件昂贵的事情.

Map Reduce编程中reducer中shuffle和sort阶段的目的是什么?

解决方案

首先shuffling是把数据从mapper传到reducers的过程,所以我觉得很明显就是reducer 是必需的,否则,它们将无法获得任何输入(或来自每个映射器的输入).洗牌甚至可以在地图阶段完成之前开始,以节省一些时间.这就是为什么当地图状态还不是 100% 时,您会看到大于 0%(但小于 33%)的 reduce 状态.

排序 为reducer 节省时间,帮助它轻松区分何时应该开始新的reduce 任务.它只是启动一个新的 reduce 任务,当排序后的输入数据中的下一个 key 与上一个不同时,简单地说.每个reduce任务都接受一个键值对列表,但它必须调用reduce()方法,该方法接受一个键列表(值)输入,因此它必须按键对值进行分组.如果输入数据在 map 阶段(本地)预先排序并在 reduce 阶段简单地合并排序(因为 reducer 从许多 mapper 获取数据),那么这样做很容易.

您在其中一个答案中提到的

Partitioning 是一个不同的过程.它确定将在哪个reducer 中发送(key, value) 对,即map 阶段的输出.默认的 Partitioner 对键使用散列来将它们分配给 reduce 任务,但您可以覆盖它并使用您自己的自定义 Partitioner.

这些步骤的重要信息来源是

请注意,如果您指定零减速器 (setNumReduceTasks(0)),则根本不会执行 shufflingsorting.然后,MapReduce 作业在 map 阶段停止,并且 map 阶段不包括任何类型的排序(因此即使是 map 阶段也更快).

更新:由于您正在寻找更正式的东西,您还可以阅读 Tom White 的书Hadoop:权威指南".这里是您问题的有趣部分.
Tom White 自 2007 年 2 月以来一直是 Apache Hadoop 提交者,并且是 Apache 软件基金会的成员,所以我想这是相当可信和官方的......

In Map Reduce programming the reduce phase has shuffling, sorting and reduce as its sub-parts. Sorting is a costly affair.

What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

解决方案

First of all shuffling is the process of transfering data from the mappers to the reducers, so I think it is obvious that it is necessary for the reducers, since otherwise, they wouldn't be able to have any input (or input from every mapper). Shuffling can start even before the map phase has finished, to save some time. That's why you can see a reduce status greater than 0% (but less than 33%) when the map status is not yet 100%.

Sorting saves time for the reducer, helping it easily distinguish when a new reduce task should start. It simply starts a new reduce task, when the next key in the sorted input data is different than the previous, to put it simply. Each reduce task takes a list of key-value pairs, but it has to call the reduce() method which takes a key-list(value) input, so it has to group values by key. It's easy to do so, if input data is pre-sorted (locally) in the map phase and simply merge-sorted in the reduce phase (since the reducers get data from many mappers).

Partitioning, that you mentioned in one of the answers, is a different process. It determines in which reducer a (key, value) pair, output of the map phase, will be sent. The default Partitioner uses a hashing on the keys to distribute them to the reduce tasks, but you can override it and use your own custom Partitioner.

A great source of information for these steps is this Yahoo tutorial (archived).

A nice graphical representation of this is the following (shuffle is called "copy" in this figure):

Note that shuffling and sorting are not performed at all if you specify zero reducers (setNumReduceTasks(0)). Then, the MapReduce job stops at the map phase, and the map phase does not include any kind of sorting (so even the map phase is faster).

UPDATE: Since you are looking for something more official, you can also read Tom White's book "Hadoop: The Definitive Guide". Here is the interesting part for your question.
Tom White has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation, so I guess it is pretty credible and official...

这篇关于Map Reduce编程中reducer中的shuffle和sorting阶段的目的是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆