hadoop如何决定有多少节点将执行映射和减少任务 [英] How hadoop decides how many nodes will do map and reduce tasks

查看:91
本文介绍了hadoop如何决定有多少节点将执行映射和减少任务的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对hadoop很陌生,我正在努力理解它。我在谈论hadoop 2.当我有一个我想要做MapReduce的输入文件时,在MapReduce程序中,我会说Split的参数,所以它会像分割一样完成许多地图任务,对吧?



资源管理器知道文件在哪里,并将任务发送给有数据的节点,但是谁说有多少节点将完成这些任务?在映射之后有洗牌,哪个节点将执行一个减少任务由分区器决定谁做一个散列映射,对吧?有多少节点可以减少任务?

谢谢。

TLDR:是否会执行映射的节点会减少任务?

如果我有一个群集并运行MapReduce作业,Hadoop如何决定有多少节点将执行映射任务,然后决定哪些节点将执行reduce任务?

h2_lin>解决方案

有多少个地图?


通常由输入的总大小驱动,也就是输入文件的总块数。

地图的正确的并行度水平似乎在10左右每个节点-100个地图,尽管它已经为非常cpu-light地图任务设置了300个地图。任务设置需要一段时间,所以最好是地图至少需要一分钟才能执行。


如果您拥有10TB的输入数据和128MB的块大小,你最终将得到82,000个地图,除非Configuration.set(MRJobConfig.NUM_MAPS,int)(它仅向框架提供提示)用于将其设置得更高。



减少多少?



减少的正确数量似乎为0.95或1.75乘以( <节点数量> *<每个节点最大容器数量>)。

在0.95的情况下,所有的缩减都可以立即启动并开始将地图输出传输为地图完成。使用1.75,更快的节点将完成他们的第一轮缩减,并推出第二轮减少更好的负载平衡。



增加减少的数量会增加但是增加了负载平衡并降低了失败的成本。



Reducer NONE



如果不需要减少,将reduce任务的数量设置为零是合法的

Reduce任务的哪个节点



您可以按照配置参数,如 mapreduce.tasktracker.reduce.tasks.maximum p>

如果将此参数设置为零,则不会将该节点视为Reduce任务。否则,群集中的所有节点都有资格使用Reduce任务。



来源: Map Reduce Tutorial 来自Apache。



注意:对于给定的作业,您可以设置mapreduce.job.maps& mapreduce.job.reduces。但它可能不会有效。我们应该决定Map Reduce Framework来决定Map&减少任务

编辑: 如何决定哪个Reducer节点? N2时有当前负载,那么Reduce任务将被分配给N2。如果负载和插槽数量都相同,则向资源管理器发送第一次心跳的人员将获得该任务。这是减少赋值的代码块: http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/ hadoop-core / 0.20.2-320 / org / apache / hadoop / mapred / JobQueueTaskScheduler.java#207


I'm new to hadoop and I'm trying to understand it. Im talking about hadoop 2. When I have an input file which I wanto to do a MapReduce, in the MapReduce programm I say the parameter of the Split, so it will make as many map tasks as splits,right?

The resource manager knows where the files are and will send the tasks to the nodes who have the data, but who says how many nodes will do the tasks? After the maps are donde there is the shuffle, which node will do a reduce task is decided by the partitioner who does a hash map,right? How many nodes will do reduce tasks? Will nodes who have done maps will do too reduce tasks?

Thank you.

TLDR: If I have a cluster and I run a MapReduce job, how does Hadoop decides how many nodes will do map tasks and then which nodes will do the reduce tasks?

解决方案

How Many Maps?

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.

If you havve 10TB of input data and a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.

How Many Reduces?

The right number of reduces seems to be 0.95 or 1.75 multiplied by ( < no. of nodes > * < no. of maximum containers per node > ).

With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.

Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.

Reducer NONE

It is legal to set the number of reduce-tasks to zero if no reduction is desired

Which nodes for Reduce tasks?

You can configure number of mappers and number of reducers per node as per Configuration parameters like mapreduce.tasktracker.reduce.tasks.maximum

if you set this parameter as zero, that node won't be considered for Reduce tasks. Otherwise, all nodes in the cluster are eligible for Reduce tasks.

Source : Map Reduce Tutorial from Apache.

Note: For a given Job, you can set mapreduce.job.maps & mapreduce.job.reduces. But it may not be effective. We should leave the decisions to Map Reduce Framework to decide on number of Map & Reduce tasks

EDIT:

How to decide which Reducer node?

Assume that you have equal reduce slots available on two nodes N1 and N2 and current load on N1 > N2, then , Reduce task will be assigned to N2. If both load and number of slots are same, whoever sends first heartbeat to resource manager will get the task. This is the code block for reduce assignment:http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/mapred/JobQueueTaskScheduler.java#207

这篇关于hadoop如何决定有多少节点将执行映射和减少任务的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆