首先运行MapReduce作业中的Combiner或Partitioner [英] Which runs first, Combiner or Partitioner in a MapReduce Job

查看:117
本文介绍了首先运行MapReduce作业中的Combiner或Partitioner的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很困惑,因为我找到了两个答案.

I am confused since I have found two answers for it.

1)根据Hadoop权威指南-第三版,第6章-Map Side说:在写入磁盘之前,线程首先将数据划分为与最终将要发送到的reducer对应的分区.在每个分区中,后台线程都按键执行内存中排序,如果有组合器功能,它将在排序的输出上运行.

1) As per Hadoop Definitive Guide - 3rd edition, Chapter 6 - The Map Side says: "Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the back-ground thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort.

2)Yahoo开发人员教程( Yahoo教程)表示Combiner在运行之前分区器.

2)Yahoo developers tutorial (Yahoo tutorial) says Combiner runs prior to partitioner.

任何人都可以先说明哪个先运行.

Can anyone please clarify which runs first.

推荐答案

Map Reduce作业可能包含这些阶段中的一个或全部

A Map Reduce Job may contain one or all of these phases

  1. 地图

组合

随机排序

减少

Partitioner位于第二和第三阶段之间

Partitioner fits between second and third phase

您可以访问此链接有关更多详细信息.

You can visit this link for more details.

经历了相关的SE问题和文章

After going through related SE questions & articles,

首先运行的是:分区器还是组合器?

谁将有机会执行首先是组合器还是分区器?

https://sreejithrpillai. wordpress.com/2014/11/24/implementing-partitioners-and-combiners-for-mapreduce/

我们可以看到意见分歧.

we can see that opinion is divided.

但是从逻辑上我感觉到

  1. 映射程序将输出写入内存中的圆环缓冲区
  2. 如果减速器数量大于1&分区器到位,映射器输出将被分区
  3. 一旦缓冲存储器已满,输出将溢出到磁盘上
  4. 根据hadoop权威指南在每个分区中,后台线程按键执行内存中排序,如果有组合器功能,它将在排序的输出上运行". em>

这意味着Partitioner应该首先运行,并且Combiner必须在每个分区内的输出数据上运行.

It implies that Partitioner should run first and combiner has to run on output data with-in each partition.

这篇关于首先运行MapReduce作业中的Combiner或Partitioner的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆