Hadoop如何确定有多少节点将执行Map and Reduce任务? [英] How does Hadoop decide how many nodes will perform the Map and Reduce tasks?

查看:180
本文介绍了Hadoop如何确定有多少节点将执行Map and Reduce任务?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是hadoop的新手,正在尝试了解它.我说的是hadoop2.当我有一个要执行MapReduce的输入文件时,在MapReduce程序中我说了Split的参数,这样它将进行与splits一样多的地图任务,对吗?

I'm new to hadoop and I'm trying to understand it. Im talking about hadoop 2. When I have an input file which I wanto to do a MapReduce, in the MapReduce programm I say the parameter of the Split, so it will make as many map tasks as splits,right?

资源管理器知道文件在哪里,并将任务发送到拥有数据的节点,但是谁说多少个节点将执行任务?在将映射放回之后,会进行混洗,哪个节点将执行还原任务,由执行哈希映射的分区程序决定,对吗?有多少个节点可以减少任务?完成地图的节点会减少任务吗?

The resource manager knows where the files are and will send the tasks to the nodes who have the data, but who says how many nodes will do the tasks? After the maps are donde there is the shuffle, which node will do a reduce task is decided by the partitioner who does a hash map,right? How many nodes will do reduce tasks? Will nodes who have done maps will do too reduce tasks?

谢谢.

TLDR:如果我有集群并运行MapReduce作业,那么Hadoop如何确定有多少节点将执行映射任务,然后由哪些节点执行约简任务?

TLDR: If I have a cluster and I run a MapReduce job, how does Hadoop decides how many nodes will do map tasks and then which nodes will do the reduce tasks?

推荐答案

有多少张地图?

映射数通常由输入的总大小(即输入文件的块总数)决定.

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

尽管已经为非常cpu-light的地图任务设置了300个地图,但地图的并行度正确水平似乎是每个节点10-100个地图.任务设置需要一段时间,因此最好至少花费一分钟来执行地图.

The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.

如果您拥有10TB的输入数据和128MB的块大小,则最终将获得82,000个映射,除非使用Configuration.set(MRJobConfig.NUM_MAPS,int)(仅向框架提供提示)进行设置甚至更高.

If you havve 10TB of input data and a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.

减少多少?

减少的正确数量似乎是0.95或1.75乘以(<<节点数> * <每个节点的最大容器数>).

The right number of reduces seems to be 0.95 or 1.75 multiplied by ( < no. of nodes > * < no. of maximum containers per node > ).

使用0.95时,所有reduce都可以立即启动,并在地图完成时开始传输地图输出.速度为1.75时,速度更快的节点将完成其第一轮减少,并发起第二次减少,从而更好地完成负载平衡.

With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.

增加减少的数量会增加框架开销,但会增加负载平衡并降低故障成本.

Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.

减速器无

如果不希望减少,则将减少任务的数量设置为零是合法的

It is legal to set the number of reduce-tasks to zero if no reduction is desired

哪些节点可用于简化任务?

您可以根据如果将此参数设置为零,则减少任务将不考虑该节点.否则,群集中的所有节点都可以执行减少任务.

if you set this parameter as zero, that node won't be considered for Reduce tasks. Otherwise, all nodes in the cluster are eligible for Reduce tasks.

来源: Apache的> Map Reduce Tutorial .

注意:对于给定的Job,您可以设置mapreduce.job.maps& mapreduce.job.reduces.但这可能无效.我们应该将决定权留给Map Reduce框架来决定Map&的数量.减少任务

Note: For a given Job, you can set mapreduce.job.maps & mapreduce.job.reduces. But it may not be effective. We should leave the decisions to Map Reduce Framework to decide on number of Map & Reduce tasks

如何确定哪个Reducer节点?

How to decide which Reducer node?

假设两个节点N1和N2上有相等的reduce插槽,并且N1> N2上有当前负载,那么Reduce任务将分配给N2.如果负载和插槽数相同,则向资源管理器发送第一个心跳的人将获得任务.这是减少分配的代码块:

Assume that you have equal reduce slots available on two nodes N1 and N2 and current load on N1 > N2, then , Reduce task will be assigned to N2. If both load and number of slots are same, whoever sends first heartbeat to resource manager will get the task. This is the code block for reduce assignment:http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/mapred/JobQueueTaskScheduler.java#207

这篇关于Hadoop如何确定有多少节点将执行Map and Reduce任务?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆