Hadoop:基于簇大小的可用地图插槽数量 [英] Hadoop: number of available map slots based on cluster size
问题描述
2013-05-06 16 :32:45,118 INFO org.apache.hadoop.mapred.JobClient(main):根据簇大小设置映射任务的默认数量为:84
有谁知道这个值是如何计算的?
我怎样才能在我的程序中获得这个值?
我擦掉了Hadoop的源代码,找到字符串根据簇大小将映射任务的默认数量设置为
(但我找到其他字符串,这些字符串在运行MR作业时正在打印)。此外,这个字符串不会被打印在本地安装的任何地方。谷歌搜索它列出了AWS与EMR的问题。
如您所确认的,您实际上使用Amazon Elastic MapReduce。我相信EMR对 JobClient 类的Hadoop,它会输出这个特定的行。
就计算这个数字而言,我会怀疑它是根据如下特征计算的:簇(N)中的(活动)节点总数
和每个节点的地图槽数量(M)
,即 N * M
。但是,也可能会考虑其他AWS特定的资源(内存)限制。您必须在EMR相关论坛中查询确切的公式。
另外, JobClient
公开了一个有关集群的信息集。使用 JobClient# getClusterStatus()可以访问以下信息:
/apache/hadoop/mapred/ClusterStatus.htmlrel =nofollow> ClusterStatus 类对象,因此您可以尝试手动在程序中计算所需的数字。
Reading the syslog generated by Hadoop, I can see lines similar to this one..
2013-05-06 16:32:45,118 INFO org.apache.hadoop.mapred.JobClient (main): Setting default number of map tasks based on cluster size to : 84
Does anyone know how this value is computed? And how can I get this value in my program?
I grepped the source code of Hadoop and did not find the string Setting default number of map tasks based on cluster size to
at all (whereas I find other strings, which are being printed when running MR jobs). Furthermore this string is not being printed anywhere in my local installation. A google search for it listed problems on AWS with EMR.
As you confirmed, your're in fact using Amazon Elastic MapReduce. I believe EMR has some own modifications to the JobClient class of Hadoop, which outputs this particular line.
As far as computing this number is concerned I would suspect it to be computed based on characteristics like total number of (active) nodes in cluster (N)
and number of map slots per node (M)
, i.e. N*M
. However, additional AWS-specific resource (memory) constraints may also be taken into account. You'd have to ask in EMR-related forums for the exact formula.
Additionaly, the JobClient
exposes a set of information about the cluster. Using the method JobClient#getClusterStatus() you can access information like:
- Size of the cluster.
- Name of the trackers.
- Number of blacklisted/active trackers.
- Task capacity of the cluster.
- The number of currently running map & reduce tasks.
via the ClusterStatus class object, so you can try and compute the desired number in your program manually.
这篇关于Hadoop:基于簇大小的可用地图插槽数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!