在Hadoop MapReduce作业中链接多重缩减器 [英] Chaining Multi-Reducers in a Hadoop MapReduce job

查看:111
本文介绍了在Hadoop MapReduce作业中链接多重缩减器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

 输入 - > Map1  - >减少1  - > Reducer2  - >减少3  - >减少4  - >输出

我注意到有 ChainMapper 类在Hadoop中可以将多个映射器链接成一个大映射器,并在映射阶段之间保存磁盘I / O成本。还有一个 ChainReducer 类,但它不是一个真正的链减速器。它只能支持这样的工作:

  [Map + / Reduce Map *] 

我知道我可以为我的任务设置四个MR作业,并为最后三个作业使用默认的映射器。但是这会花费大量的磁盘I / O,因为缩减器应该将结果写入磁盘以让后续的映射器访问它。是否还有其他Hadoop内置功能来链接我的Reducer以降低I / O成本?



我正在使用Hadoop 1.0.4。

解决方案

我不认为您可以将reducer的o / p直接发给另一个reducer 。我会这样做:

  Input-> Map1  - >减少1  - > 
身份映射器 - > Reducer2 - >
身份映射器 - >减少3 - >
身份映射器 - >减少4 - >输出

在Hadoop 2.X系列中,您可以在内部使用ChainMapper和链映射器在Reducer之前链接映射器减速器与 ChainReducer


Now I have a 4-phase MapReduce job as follows:

Input-> Map1 -> Reduce1 -> Reducer2 -> Reduce3 -> Reduce4 -> Output

I notice that there is ChainMapper class in Hadoop which can chain several mappers into one big mapper, and save the disk I/O cost between map phases. There is also a ChainReducer class, however it is not a real "Chain-Reducer". It can only support jobs like:

[Map+/ Reduce Map*]

I know I can set four MR jobs for my task, and use default mappers for the last three jobs. But that will cost a lot of disk I/O, since reducers should write the result into disk to let the following mapper access it. Is there any other Hadoop built-in feature to chain my reducers to lower the I/O cost?

I am using Hadoop 1.0.4.

解决方案

I dont think that you can have the o/p of a reducer being given to another reducer directly. I would have gone for this:

Input-> Map1 -> Reduce1 -> 
        Identity mapper -> Reducer2 -> 
                Identity mapper -> Reduce3 -> 
                         Identity mapper -> Reduce4 -> Output

In Hadoop 2.X series, internally you can chain mappers before reducer with ChainMapper and chain Mappers after reducer with ChainReducer.

这篇关于在Hadoop MapReduce作业中链接多重缩减器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆