合成器在哪里组合mapper输出 - 在map-reduce作业的map阶段还是reduce阶段? [英] where does combiners combine mapper outputs - in map phase or reduce phase in a Map-reduce job?

查看:156
本文介绍了合成器在哪里组合mapper输出 - 在map-reduce作业的map阶段还是reduce阶段?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的印象是,合成器就像是在本地地图任务上起作用的缩减器,它汇总了单个Map任务的结果,以减少输出传输的网络带宽。



从阅读 Hadoop-权威指南第三版,我的理解似乎是正确的。



第2章(第34页)

组合函数
许多MapReduce作业受集群上可用带宽的限制,因此它支付尽量减少地图和减少任务之间传输的数据。 Hadoop允许用户指定要在地图输出上运行的组合函数 - 组合函数的输出形成了reduce函数的输入。由于组合函数是一种优化,因此Hadoop无法保证它将为特定的映射输出记录调用它多少次,如果有的话。换句话说,调用组合函数零次,一次或多次应该会产生相同的输出。



所以我尝试了以下wordcount问题:

  job.setMapperClass(mapperClass); 
job.setCombinerClass(reduceClass);
job.setNumReduceTasks(0);

以下是计数器:

  14/07/18 10:40:15信息mapred.JobClient:计数器:10 
14/07/18 10:40:15信息mapred.JobClient:文件系统计数器
14/07/18 10:40:15信息mapred.JobClient:FILE:读取的字节数= 293
14/07/18 10:40:15信息mapred.JobClient:FILE:写入的字节数= 75964
14/07/18 10:40:15信息mapred.JobClient:FILE:读取操作数量= 0
14/07/18 10:40:15信息mapred.JobClient:FILE:大量读操作的数量= 0
14/07/18 10:40:15信息mapred.JobClient:FILE:写操作的数量= 0
14/07/18 10:40:15 INFO mapred .JobClient:Map-Reduce Framework
14/07/18 10:40:15信息mapred.JobClient:地图输入记录= 7
14/07/18 10:40:15信息mapred.JobClient: Map输出记录= 16
14/07/18 10:40:15信息mapred.JobClient:输入拆分字节= 125
14/07/18 10:40:15信息mapred.JobClient:Spill Records = 0
14/07/18 10: 40:15信息mapred.JobClient:总承诺堆使用率(字节)= 85000192

这里是 part-m-00000

  hello 1 
world 1
Hadoop 1
编程1
mapreduce 1
wordcount 1
让1
查看1
如果1
这1
works 1
12345678 1
hello 1
world 1
mapreduce 1
wordcount 1

显然没有应用组合器。我知道Hadoop并不保证是否会有组合器被调用。但是,当我打开缩小阶段时,组合器被调用。



为什么是这种行为?



现在当我阅读关于第6章(第208页)的MapReduce如何工作。我在 Reduce side 中看到此段落。


$ b

映射输出被复制到reduce任务JVM的内存(如果它们足够小)(缓冲区的大小由mapred.job.shuffle.input.buffer.percent控制,它指定用于此目的的堆的比例);否则,它们被复制到磁盘。当内存缓冲区达到阈值大小(由mapred.job.shuffle.merge.percent控制)或达到阈值数量的映射输出(mapred.inmem.merge.threshold)时,它会合并并溢出到磁盘。如果指定了组合器,它将在合并过程中运行,以减少写入磁盘的数据量。



这段话的推论是:
1)组合器 ALSO 在缩小阶段运行。

解决方案

主要功能组合器是优化。它在大多数情况下就像一个小型减速器。从同一本书的第206页 - 章节 - mapreduce如何工作(地图侧):


运行组合器功能可以使结构更加紧凑地图输出,所以有更少的数据写入本地磁盘并传输到reducer。


您的问题引用


如果指定了组合器,它将在合并过程中运行以减少写入磁盘的数据量。 b

这两个引号表明组合器主要用于紧凑性。减少输出传输的网络带宽是这种优化的一个优势。



另外,从同一本书中,


回想一下,组合器
可以在输入上重复运行而不会影响最终结果。如果只有
一次或两次泄漏,那么映射输出大小的潜在减少不值得调用组合器的
开销,因此它不会再为该映射输出运行。


这意味着hadoop不能保证组合器运行了多少次(也可能为零)

组合器绝不会运行于仅限地图的作业。这很有意义,因为组合器会更改地图输出。另外,由于它不保证被调用的次数,所以地图输出也不能保证相同。

I was under the impression that combiners are just like reducers that act on the local map task, That is it aggregates the results of individual Map task in order to reduce the network bandwidth for output transfer.

And from reading Hadoop- The definitive guide 3rd edition, my understanding seems correct.

From chapter 2 (page 34)

Combiner Functions Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to minimize the data transferred between map and reduce tasks. Hadoop allows the user to specify a combiner function to be run on the map output—the combiner func- tion’s output forms the input to the reduce function. Since the combiner function is an optimization, Hadoop does not provide a guarantee of how many times it will call it for a particular map output record, if at all. In other words, calling the combiner function zero, one, or many times should produce the same output from the reducer.

So I tried the following on the wordcount problem:

job.setMapperClass(mapperClass);
job.setCombinerClass(reduceClass);
job.setNumReduceTasks(0);

Here is the counters:

14/07/18 10:40:15 INFO mapred.JobClient: Counters: 10
14/07/18 10:40:15 INFO mapred.JobClient:   File System Counters
14/07/18 10:40:15 INFO mapred.JobClient:     FILE: Number of bytes read=293
14/07/18 10:40:15 INFO mapred.JobClient:     FILE: Number of bytes written=75964
14/07/18 10:40:15 INFO mapred.JobClient:     FILE: Number of read operations=0
14/07/18 10:40:15 INFO mapred.JobClient:     FILE: Number of large read operations=0
14/07/18 10:40:15 INFO mapred.JobClient:     FILE: Number of write operations=0
14/07/18 10:40:15 INFO mapred.JobClient:   Map-Reduce Framework
14/07/18 10:40:15 INFO mapred.JobClient:     Map input records=7
14/07/18 10:40:15 INFO mapred.JobClient:     Map output records=16
14/07/18 10:40:15 INFO mapred.JobClient:     Input split bytes=125
14/07/18 10:40:15 INFO mapred.JobClient:     Spilled Records=0
14/07/18 10:40:15 INFO mapred.JobClient:     Total committed heap usage (bytes)=85000192

and here is part-m-00000:

hello   1
world   1
Hadoop  1
programming 1
mapreduce   1
wordcount   1
lets    1
see 1
if  1
this    1
works   1
12345678    1
hello   1
world   1
mapreduce   1
wordcount   1

so clearly no combiner is applied. I understand that Hadoop does not guarantee if a combiner will be called at all. But when I turn on the reduce phase, the combiner gets called.

WHY IS THIS BEHAVIOR?

Now when I read chapter 6 (page 208) on how MapReduce works. I see this paragraph described in the Reduce side.

The map outputs are copied to the reduce task JVM’s memory if they are small enough (the buffer’s size is controlled by mapred.job.shuffle.input.buffer.percent, which specifies the proportion of the heap to use for this purpose); otherwise, they are copied to disk. When the in-memory buffer reaches a threshold size (controlled by mapred.job.shuffle.merge.percent), or reaches a threshold number of map outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk. If a combiner is specified it will be run during the merge to reduce the amount of data written to disk.

My inferences from this paragraph are : 1) Combiner is ALSO run during the reduce phase.

解决方案

The main function of a combiner is optimization. It acts like a mini-reducer for most cases. From page 206 of the same book, chapter - How mapreduce works(The map side):

Running the combiner function makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer.

The quote from your question,

If a combiner is specified it will be run during the merge to reduce the amount of data written to disk.

Both the quotes indicate that a combiner is run primarily for compactness. Reducing the network bandwidth for output transfer is an advantage of this optimization.

Also, from the same book,

Recall that combiners may be run repeatedly over the input without affecting the final result. If there are only one or two spills, then the potential reduction in map output size is not worth the overhead in invoking the combiner, so it is not run again for this map output.

Meaning that hadoop doesn't guarentee how many times a combiner is run(could be zero also)

A combiner is never run for map-only jobs. It makes sense because, a combiner changes the map output. Also, since it doesn't guarantee the number of times it is called, the map output is not guaranteed to be the same either.

这篇关于合成器在哪里组合mapper输出 - 在map-reduce作业的map阶段还是reduce阶段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆