Hadoop 如何决定执行 Map 和 Reduce 任务的节点数量? [英] How does Hadoop decide how many nodes will perform the Map and Reduce tasks?

查看:33
本文介绍了Hadoop 如何决定执行 Map 和 Reduce 任务的节点数量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 hadoop 的新手,我正在努力理解它.我说的是hadoop 2.当我有一个输入文件想要做一个MapReduce时,在MapReduce程序中我说Split的参数,所以它会做和split一样多的map任务,对吧?

I'm new to hadoop and I'm trying to understand it. Im talking about hadoop 2. When I have an input file which I wanto to do a MapReduce, in the MapReduce programm I say the parameter of the Split, so it will make as many map tasks as splits,right?

资源管理器知道文件在哪里,并将任务发送给拥有数据的节点,但谁说有多少节点会执行任务?映射完之后就是shuffle,哪个节点做reduce任务是由做hash映射的partitioner决定的吧?有多少节点会做reduce任务?做过map的节点会不会做reduce任务?

The resource manager knows where the files are and will send the tasks to the nodes who have the data, but who says how many nodes will do the tasks? After the maps are donde there is the shuffle, which node will do a reduce task is decided by the partitioner who does a hash map,right? How many nodes will do reduce tasks? Will nodes who have done maps will do too reduce tasks?

谢谢.

TLDR:如果我有一个集群并且我运行 MapReduce 作业,Hadoop 如何决定有多少节点将执行 map 任务,然后哪些节点将执行 reduce 任务?

TLDR: If I have a cluster and I run a MapReduce job, how does Hadoop decides how many nodes will do map tasks and then which nodes will do the reduce tasks?

推荐答案

有多少地图?

map 的数量通常由输入的总大小决定,即输入文件的总块数.

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

地图的正确并行度似乎是每个节点大约 10-100 个地图,尽管它已被设置为 300 个地图用于非常 CPU 光的地图任务.任务设置需要一段时间,所以最好让地图至少需要一分钟来执行.

The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.

如果您有 10TB 的输入数据和 128MB 的块大小,那么您将得到 82,000 个映射,除非使用 Configuration.set(MRJobConfig.NUM_MAPS, int)(仅向框架提供提示)用于设置甚至更高.

If you havve 10TB of input data and a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.

减少多少?

reduce 的正确数量似乎是 0.95 或 1.75 乘以 ( < 节点数 > * < 每个节点的最大容器数 > ).

The right number of reduces seems to be 0.95 or 1.75 multiplied by ( < no. of nodes > * < no. of maximum containers per node > ).

使用 0.95,所有的 reduce 都可以立即启动,并在地图完成时开始传输地图输出.使用 1.75 时,更快的节点将完成第一轮缩减并启动第二波缩减,从而在负载平衡方面做得更好.

With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.

增加减少的数量会增加框架开销,但会增加负载平衡并降低故障成本.

Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.

减速器无

如果不需要减少,则将减少任务的数量设置为零是合法的

It is legal to set the number of reduce-tasks to zero if no reduction is desired

Reduce 任务有哪些节点?

您可以根据 配置参数,如ma​​preduce.tasktracker.reduce.tasks.maximum

如果您将此参数设置为零,则该节点将不会被考虑用于 Reduce 任务.否则,集群中的所有节点都有资格执行 Reduce 任务.

if you set this parameter as zero, that node won't be considered for Reduce tasks. Otherwise, all nodes in the cluster are eligible for Reduce tasks.

来源:Map Reduce 教程来自 Apache.

Source : Map Reduce Tutorial from Apache.

注意:对于给定的 Job,您可以设置 mapreduce.job.maps &mapreduce.job.reduces.但它可能不会有效.我们应该让 Map Reduce Framework 来决定 Map 的数量和数量.减少任务

Note: For a given Job, you can set mapreduce.job.maps & mapreduce.job.reduces. But it may not be effective. We should leave the decisions to Map Reduce Framework to decide on number of Map & Reduce tasks

如何决定哪个Reducer节点?

How to decide which Reducer node?

假设你在两个节点 N1 和 N2 上有相等的 reduce 槽可用,并且 N1 上的当前负载 > N2,那么,Reduce 任务将分配给 N2.如果负载和槽数相同,则向资源管理器发送第一个心跳的人将获得任务.这是reduce赋值的代码块:http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/mapred/JobQueueTaskScheduler.java#207

Assume that you have equal reduce slots available on two nodes N1 and N2 and current load on N1 > N2, then , Reduce task will be assigned to N2. If both load and number of slots are same, whoever sends first heartbeat to resource manager will get the task. This is the code block for reduce assignment:http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/mapred/JobQueueTaskScheduler.java#207

这篇关于Hadoop 如何决定执行 Map 和 Reduce 任务的节点数量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆