Hadoop 中的reduce 任务什么时候开始? [英] When do reduce tasks start in Hadoop?

查看:22
本文介绍了Hadoop 中的reduce 任务什么时候开始?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Hadoop 中,reduce 任务什么时候开始?它们是否在一定比例(阈值)的映射器完成后开始?如果是这样,这个阈值是固定的吗?通常使用什么样的阈值?

In Hadoop when do reduce tasks start? Do they start after a certain percentage (threshold) of mappers complete? If so, is this threshold fixed? What kind of threshold is typically used?

推荐答案

reduce 阶段有 3 个步骤:shuffle、sort、reduce.Shuffle是reducer从每个mapper收集数据的地方.这可能发生在映射器生成数据时,因为它只是一个数据传输.另一方面,排序和归约只能在所有映射器都完成后开始.您可以通过查看 reducer 完成百分比来判断 MapReduce 正在做什么:0-33% 表示它在做 shuffle,34-66% 是排序,67%-100% 是 reduce.这就是为什么您的减速器有时看起来卡在"33% 的原因——它在等待映射器完成.

The reduce phase has 3 steps: shuffle, sort, reduce. Shuffle is where the data is collected by the reducer from each mapper. This can happen while mappers are generating data since it is only a data transfer. On the other hand, sort and reduce can only start once all the mappers are done. You can tell which one MapReduce is doing by looking at the reducer completion percentage: 0-33% means its doing shuffle, 34-66% is sort, 67%-100% is reduce. This is why your reducers will sometimes seem "stuck" at 33%-- it's waiting for mappers to finish.

Reducer 根据已完成映射器的百分比阈值开始改组.您可以更改参数以让减速器早晚启动.

Reducers start shuffling based on a threshold of percentage of mappers that have finished. You can change the parameter to get reducers to start sooner or later.

为什么尽早启动减速器是一件好事?因为它会随着时间的推移分散从映射器到减速器的数据传输,如果您的网络是瓶颈,这是一件好事.

Why is starting the reducers early a good thing? Because it spreads out the data transfer from the mappers to the reducers over time, which is a good thing if your network is the bottleneck.

为什么提前启动减速器是件坏事?因为它们占用"减少插槽,而只复制数据并等待映射器完成.另一个稍后开始的实际使用 reduce 槽的作业现在无法使用它们.

Why is starting the reducers early a bad thing? Because they "hog up" reduce slots while only copying data and waiting for mappers to finish. Another job that starts later that will actually use the reduce slots now can't use them.

您可以通过更改mapred-site.xmlmapred.reduce.slowstart.completed.maps的默认值来自定义reducers启动的时间.1.00 的值将在启动减速器之前等待所有映射器完成.0.0 的值将立即启动减速器.0.5 的值将在一半的映射器完成时启动减速器.您还可以逐个作业更改 mapred.reduce.slowstart.completed.maps.在新版本的 Hadoop(至少 2.4.1)中,参数被称为 mapreduce.job.reduce.slowstart.completedmaps(感谢用户 yegor256).

You can customize when the reducers startup by changing the default value of mapred.reduce.slowstart.completed.maps in mapred-site.xml. A value of 1.00 will wait for all the mappers to finish before starting the reducers. A value of 0.0 will start the reducers right away. A value of 0.5 will start the reducers when half of the mappers are complete. You can also change mapred.reduce.slowstart.completed.maps on a job-by-job basis. In new versions of Hadoop (at least 2.4.1) the parameter is called is mapreduce.job.reduce.slowstart.completedmaps (thanks user yegor256).

通常,如果系统同时运行多个作业,我喜欢将 mapred.reduce.slowstart.completed.maps 保持在 0.9 以上.这样,当他们除了复制数据之外什么都不做时,作业不会占用减速器.如果您一次只运行一项作业,则执行 0.1 可能是合适的.

Typically, I like to keep mapred.reduce.slowstart.completed.maps above 0.9 if the system ever has multiple jobs running at once. This way the job doesn't hog up reducers when they aren't doing anything but copying data. If you only ever have one job running at a time, doing 0.1 would probably be appropriate.

这篇关于Hadoop 中的reduce 任务什么时候开始?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆