基于mapreduce框架决定是否启动组合器 [英] On what basis mapreduce framework decides whether to launch a combiner or not
问题描述
我想知道,基于mapreduce框架决定cobiner将被启动多少次。
简单地说,泄漏到磁盘的数量。排序发生在 MapOutputBuffer
填满之后,同时进行组合。
您可以调整使用参数 io.sort.mb
, io.sort.spill.percent
, io.sort.record.percent
- 这些也在文档(书籍和在线资源)中解释。
特定组合器运行次数的示例:
0 - >无合并器被定义
1 - >一个组合器被定义并且MapOutputBuffer被填充一次
> 1 - > a组合器被定义并且MapOutputBuffer被多次填充
请注意,即使 MapOutputBuffer
永远不会完全填满,这个缓冲区必须在映射阶段结束时刷新,因此触发组合器至少运行一次(如果定义的话)。
As per definition "The Combiner may be called 0, 1, or many times on each key between the mapper and reducer."
I want to know that on what basis mapreduce framework decides how many times cobiner will be launched.
Simply the number of spills to disk. Sorting happens after the MapOutputBuffer
filled up, at the same time the combining will take place.
You can tune the number of spills to disk with the parameters io.sort.mb
, io.sort.spill.percent
, io.sort.record.percent
- those are also explained in the documentation (books and online resources).
Example for specific numbers of combiner runs:
0 -> no combiner was defined
1 -> a combiner was defined and the MapOutputBuffer filled up once
>1 -> a combiner was defined and the MapOutputBuffer filled up more than once
Note that even if the MapOutputBuffer
never fills up completely, this buffer must be flushed at the end of the map stage and thus triggers the combiner to run at least once (if defined).
这篇关于基于mapreduce框架决定是否启动组合器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!