Hadoop:基于簇大小的可用地图插槽数量 [英] Hadoop: number of available map slots based on cluster size

查看:102
本文介绍了Hadoop:基于簇大小的可用地图插槽数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

阅读由Hadoop生成的系统日志,我可以看到与此类似的行。

  2013-05-06 16 :32:45,118 INFO org.apache.hadoop.mapred.JobClient(main):根据簇大小设置映射任务的默认数量为:84 

有谁知道这个值是如何计算的?
我怎样才能在我的程序中获得这个值?

解决方案

我擦掉了Hadoop的源代码,找到字符串根据簇大小将映射任务的默认数量设置为(但我找到其他字符串,这些字符串在运行MR作业时正在打印)。此外,这个字符串不会被打印在本地安装的任何地方。谷歌搜索它列出了AWS与EMR的问题。
如您所确认的,您实际上使用Amazon Elastic MapReduce。我相信EMR对 JobClient 类的Hadoop,它会输出这个特定的行。



就计算这个数字而言,我会怀疑它是根据如下特征计算的:簇(N)中的(活动)节点总数每个节点的地图槽数量(M),即 N * M 。但是,也可能会考虑其他AWS特定的资源(内存)限制。您必须在EMR相关论坛中查询确切的公式。



另外, JobClient 公开了一个有关集群的信息集。使用 JobClient# getClusterStatus()可以访问以下信息:


  • 群集大小

  • 追踪者名称

  • 列入黑名单/活动追踪者的数量

  • 群集的任务容量
  • 当前正在运行的地图数量&减少任务。通过


/apache/hadoop/mapred/ClusterStatus.htmlrel =nofollow> ClusterStatus 类对象,因此您可以尝试手动在程序中计算所需的数字。


Reading the syslog generated by Hadoop, I can see lines similar to this one..

2013-05-06 16:32:45,118 INFO org.apache.hadoop.mapred.JobClient (main): Setting default number of map tasks based on cluster size to : 84

Does anyone know how this value is computed? And how can I get this value in my program?

解决方案

I grepped the source code of Hadoop and did not find the string Setting default number of map tasks based on cluster size to at all (whereas I find other strings, which are being printed when running MR jobs). Furthermore this string is not being printed anywhere in my local installation. A google search for it listed problems on AWS with EMR. As you confirmed, your're in fact using Amazon Elastic MapReduce. I believe EMR has some own modifications to the JobClient class of Hadoop, which outputs this particular line.

As far as computing this number is concerned I would suspect it to be computed based on characteristics like total number of (active) nodes in cluster (N) and number of map slots per node (M), i.e. N*M. However, additional AWS-specific resource (memory) constraints may also be taken into account. You'd have to ask in EMR-related forums for the exact formula.

Additionaly, the JobClient exposes a set of information about the cluster. Using the method JobClient#getClusterStatus() you can access information like:

  • Size of the cluster.
  • Name of the trackers.
  • Number of blacklisted/active trackers.
  • Task capacity of the cluster.
  • The number of currently running map & reduce tasks.

via the ClusterStatus class object, so you can try and compute the desired number in your program manually.

这篇关于Hadoop:基于簇大小的可用地图插槽数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆