什么时候减少任务从Hadoop开始? [英] When do reduce tasks start in Hadoop?

查看:95
本文介绍了什么时候减少任务从Hadoop开始?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Hadoop何时减少任务开始?它们是否在一定百分比(阈值)的映射器完成后启动?如果是这样,这个阈值是否固定?通常使用什么样的阈值?

解决方案

缩小阶段有3个步骤:shuffle,sort,reduce。 Shuffle是减速器从每个映射器收集数据的位置。这可能会在映射器生成数据时发生,因为它只是数据传输。另一方面,只有在完成所有映射器后才能开始排序和缩减。您可以通过查看Reducer完成百分比来确定MapReduce正在执行的操作:0-33%意味着它正在洗牌,34-66%是排序,67%-100%是减少。这就是为什么你的reducer有时会陷入33% - 它正在等待mappers完成。



Reducers开始基于映射器百分比阈值进行混洗完成了。您可以更改参数以使减速器迟早启动。



为什么启动减速器早就是件好事?因为它随着时间的推移将数据从映射器传输到reducer,如果您的网络是瓶颈,这是一件好事。



为什么要尽早启动reducer坏事?因为他们生硬减少了插槽,而只复制数据并等待映射器完成。另一项开始后会实际使用reduce插槽的作业现在不能使用它们。



您可以通过更改<$ c的默认值$ c> mapred.reduce.slowstart.completed.maps 位于 mapred-site.xml 中。值 1.00 将在启动减速器之前等待所有映射器完成。值 0.0 将立即启动减速器。当映射器的一半完成时, 0.5 的值将启动减速器。您也可以逐个更改 mapred.reduce.slowstart.completed.maps 在Hadoop的新版本(至少2.4.1)中,参数被调用为 mapreduce.job.reduce.slowstart.completedmaps (感谢用户yegor256 )。
$ b 通常,我喜欢将 mapred.reduce.slowstart.completed.maps 保留在<$ c $之上c> 0.9 如果系统有多个作业同时运行。这样,当他们没有做任何事情而只是复制数据时,这项工作并没有减少减速器。如果您一次只运行一项工作,则 0.1 可能适合。


In Hadoop when do reduce tasks start? Do they start after a certain percentage (threshold) of mappers complete? If so, is this threshold fixed? What kind of threshold is typically used?

解决方案

The reduce phase has 3 steps: shuffle, sort, reduce. Shuffle is where the data is collected by the reducer from each mapper. This can happen while mappers are generating data since it is only a data transfer. On the other hand, sort and reduce can only start once all the mappers are done. You can tell which one MapReduce is doing by looking at the reducer completion percentage: 0-33% means its doing shuffle, 34-66% is sort, 67%-100% is reduce. This is why your reducers will sometimes seem "stuck" at 33%-- it's waiting for mappers to finish.

Reducers start shuffling based on a threshold of percentage of mappers that have finished. You can change the parameter to get reducers to start sooner or later.

Why is starting the reducers early a good thing? Because it spreads out the data transfer from the mappers to the reducers over time, which is a good thing if your network is the bottleneck.

Why is starting the reducers early a bad thing? Because they "hog up" reduce slots while only copying data and waiting for mappers to finish. Another job that starts later that will actually use the reduce slots now can't use them.

You can customize when the reducers startup by changing the default value of mapred.reduce.slowstart.completed.maps in mapred-site.xml. A value of 1.00 will wait for all the mappers to finish before starting the reducers. A value of 0.0 will start the reducers right away. A value of 0.5 will start the reducers when half of the mappers are complete. You can also change mapred.reduce.slowstart.completed.maps on a job-by-job basis. In new versions of Hadoop (at least 2.4.1) the parameter is called is mapreduce.job.reduce.slowstart.completedmaps (thanks user yegor256).

Typically, I like to keep mapred.reduce.slowstart.completed.maps above 0.9 if the system ever has multiple jobs running at once. This way the job doesn't hog up reducers when they aren't doing anything but copying data. If you only ever have one job running at a time, doing 0.1 would probably be appropriate.

这篇关于什么时候减少任务从Hadoop开始?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆