在Hadoop MapReduce作业中链接多重缩减器 [英] Chaining Multi-Reducers in a Hadoop MapReduce job
问题描述
输入 - > Map1 - >减少1 - > Reducer2 - >减少3 - >减少4 - >输出
我注意到有 ChainMapper
类在Hadoop中可以将多个映射器链接成一个大映射器,并在映射阶段之间保存磁盘I / O成本。还有一个 ChainReducer
类,但它不是一个真正的链减速器。它只能支持这样的工作:
[Map + / Reduce Map *]
我知道我可以为我的任务设置四个MR作业,并为最后三个作业使用默认的映射器。但是这会花费大量的磁盘I / O,因为缩减器应该将结果写入磁盘以让后续的映射器访问它。是否还有其他Hadoop内置功能来链接我的Reducer以降低I / O成本?
我正在使用Hadoop 1.0.4。
我不认为您可以将reducer的o / p直接发给另一个reducer 。我会这样做:
Input-> Map1 - >减少1 - >
身份映射器 - > Reducer2 - >
身份映射器 - >减少3 - >
身份映射器 - >减少4 - >输出
在Hadoop 2.X系列中,您可以在内部使用ChainMapper和链映射器在Reducer之前链接映射器减速器与 ChainReducer 。
Now I have a 4-phase MapReduce job as follows:
Input-> Map1 -> Reduce1 -> Reducer2 -> Reduce3 -> Reduce4 -> Output
I notice that there is ChainMapper
class in Hadoop which can chain several mappers into one big mapper, and save the disk I/O cost between map phases. There is also a ChainReducer
class, however it is not a real "Chain-Reducer". It can only support jobs like:
[Map+/ Reduce Map*]
I know I can set four MR jobs for my task, and use default mappers for the last three jobs. But that will cost a lot of disk I/O, since reducers should write the result into disk to let the following mapper access it. Is there any other Hadoop built-in feature to chain my reducers to lower the I/O cost?
I am using Hadoop 1.0.4.
I dont think that you can have the o/p of a reducer being given to another reducer directly. I would have gone for this:
Input-> Map1 -> Reduce1 ->
Identity mapper -> Reducer2 ->
Identity mapper -> Reduce3 ->
Identity mapper -> Reduce4 -> Output
In Hadoop 2.X series, internally you can chain mappers before reducer with ChainMapper and chain Mappers after reducer with ChainReducer.
这篇关于在Hadoop MapReduce作业中链接多重缩减器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!