YARN作业访问的资源似乎少于Ambari YARN管理器报告的资源 [英] YARN job appears to have access to less resources than Ambari YARN manager reports

查看:231
本文介绍了YARN作业访问的资源似乎少于Ambari YARN管理器报告的资源的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试运行YARN进程并出错时会感到困惑.在查看ambari UI YARN部分时,看到... (请注意它说有60GB可用空间). 但是,当尝试运行YARN进程时,出现错误消息,表明可用资源少于ambari中报告的资源,请参阅...

Getting confused when trying to run a YARN process and getting errors. Looking in ambari UI YARN section, seeing... (note it says 60GB available). Yet, when trying to run an YARN process, getting errors indicating that there are less resources available than is being reported in ambari, see...

➜  h2o-3.26.0.2-hdp3.1 hadoop jar h2odriver.jar -nodes 4 -mapperXmx 5g -output /home/ml1/hdfsOutputDir
Determining driver host interface for mapper->driver callback...
    [Possible callback IP address: 192.168.122.1]
    [Possible callback IP address: 172.18.4.49]
    [Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.18.4.49:46721
(You can override these with -driverif and -driverport/-driverportrange and/or specify external IP using -extdriverif.)
Memory Settings:
    mapreduce.map.java.opts:     -Xms5g -Xmx5g -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
    Extra memory percent:        10
    mapreduce.map.memory.mb:     5632
Hive driver not present, not generating token.
19/08/07 12:37:19 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/08/07 12:37:19 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200
19/08/07 12:37:19 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/ml1/.staging/job_1565057088651_0007
19/08/07 12:37:21 INFO mapreduce.JobSubmitter: number of splits:4
19/08/07 12:37:21 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1565057088651_0007
19/08/07 12:37:21 INFO mapreduce.JobSubmitter: Executing with tokens: []
19/08/07 12:37:21 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.1.0.0-78/0/resource-types.xml
19/08/07 12:37:21 INFO impl.YarnClientImpl: Submitted application application_1565057088651_0007
19/08/07 12:37:21 INFO mapreduce.Job: The url to track the job: http://HW01.ucera.local:8088/proxy/application_1565057088651_0007/
Job name 'H2O_80092' submitted
JobTracker job ID is 'job_1565057088651_0007'
For YARN users, logs command is 'yarn logs -applicationId application_1565057088651_0007'
Waiting for H2O cluster to come up...
19/08/07 12:37:38 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/08/07 12:37:38 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200

----- YARN cluster metrics -----
Number of YARN worker nodes: 4

----- Nodes -----
Node: http://HW03.ucera.local:8042 Rack: /default-rack, RUNNING, 1 containers used, 5.0 / 15.0 GB used, 1 / 3 vcores used
Node: http://HW04.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://hw05.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://HW02.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used

----- Queues -----
Queue name:            default
    Queue state:       RUNNING
    Current capacity:  0.08
    Capacity:          1.00
    Maximum capacity:  1.00
    Application count: 1
    ----- Applications in this queue -----
    Application ID:                  application_1565057088651_0007 (H2O_80092)
        Started:                     ml1 (Wed Aug 07 12:37:21 HST 2019)
        Application state:           FINISHED
        Tracking URL:                http://HW01.ucera.local:8088/proxy/application_1565057088651_0007/
        Queue name:                  default
        Used/Reserved containers:    1 / 0
        Needed/Used/Reserved memory: 5.0 GB / 5.0 GB / 0.0 GB
        Needed/Used/Reserved vcores: 1 / 1 / 0

Queue 'default' approximate utilization: 5.0 / 60.0 GB used, 1 / 12 vcores used

----------------------------------------------------------------------

ERROR: Unable to start any H2O nodes; please contact your YARN administrator.

       A common cause for this is the requested container size (5.5 GB)
       exceeds the following YARN settings:

           yarn.nodemanager.resource.memory-mb
           yarn.scheduler.maximum-allocation-mb

----------------------------------------------------------------------

For YARN users, logs command is 'yarn logs -applicationId application_1565057088651_0007'

注意

错误:无法启动任何H2O节点;请与您的YARN管理员联系.

ERROR: Unable to start any H2O nodes; please contact your YARN administrator.

一个常见的原因是请求的容器大小(5.5 GB)超过了以下YARN设置:

A common cause for this is the requested container size (5.5 GB) exceeds the following YARN settings:

  yarn.nodemanager.resource.memory-mb
  yarn.scheduler.maximum-allocation-mb

但是,我已将YARN配置为

Yet, I have YARN configured with

yarn.scheduler.maximum-allocation-vcores=3
yarn.nodemanager.resource.cpu-vcores=3
yarn.nodemanager.resource.memory-mb=15GB
yarn.scheduler.maximum-allocation-mb=15GB

我们可以看到容器和节点资源限制都高于请求的容器大小.

and we can see both container and node resource restrictions are higher than the requested container size.

尝试使用默认的mapreduce pi示例进行更复杂的计算

Trying to do a heftier calculation with the default mapreduce pi example

[myuser@HW03 ~]$ yarn jar /usr/hdp/3.1.0.0-78/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 1000 1000
Number of Maps  = 1000
Samples per Map = 1000
....

并检查RM UI,我可以看到至少在某些情况下可以使用RM的所有60GB资源(请注意图像底部的61440MB)

and checking the RM UI, I can see that it is at least possible in some cases to use all of the RM's 60GB of resources (notice the 61440MBs in the bottom of the image)

所以有一些我不了解的问题

So there are some things about the problem that I don't understand

队列默认"近似利用率:已使用5.0/60.0 GB,已使用1/12个vcore

Queue 'default' approximate utilization: 5.0 / 60.0 GB used, 1 / 12 vcores used

我想使用YARN可以提供的完整的60GB(或至少可以选择而不是抛出错误).会认为应该有足够的资源来使4个节点中的每一个都为该进程提供15GB(>请求的4x5GB = 20GB).我在这里想念什么吗?请注意,我只有YARN的默认根队列设置吗?

I would like to use the full 60GB that YARN can ostensibly provide (or at least have the option to, rather than have errors thrown). Would think that there should be enough resources to have each of the 4 nodes provide 15GB (> requested 4x5GB=20GB) to the process. Am I missing something here? Note that I only have the default root queue setup for YARN?

-----节点-----

----- Nodes -----

节点: http://HW03.ucera.local:8042 机架:/default-rack ,正在运行,已使用1个容器,已使用5.0/15.0 GB,已使用1/3个vcores

Node: http://HW03.ucera.local:8042 Rack: /default-rack, RUNNING, 1 containers used, 5.0 / 15.0 GB used, 1 / 3 vcores used

节点: http://HW04.ucera.local:8042 机架:/default-rack ,正在运行,已使用0个容器,已使用0.0/15.0 GB,已使用0/3个vcores

Node: http://HW04.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used

....

为什么在错误输出之前仅使用单个节点?

Why is only a single node being used before erroring out?

从这两件事看来,似乎既没有超过15GB的节点限制也没有超过60GB的群集限制,那么为什么会抛出这些错误呢?我在这里误解了这种情况吗?可以做些什么来解决(再次,希望能够将所有60GB的YARN表面资源正确地用于工作)?有修复的调试建议吗?

From these two things, it seems that neither the 15GB node limit nor the 60GB cluster limit are being exceeded, so why are these errors being thrown? What about this situation am I misinterpreting here? What can be done to fix (again, would like to be able to use all of the apparent 60GB of YARN resources for the job without error)? Any debugging suggestions of fixes?

更新:

问题似乎与如何为HDP/ambari创建的用户正确更改uid有关? a 节点上有一个用户,并且有一个具有正确权限的hdfs://user/<username>目录(因为从

Problem appears to be related to How to properly change uid for HDP / ambari-created user? and the fact that having a user exist on a node and have a hdfs://user/<username> directory with correct permissions (as I was lead to believe from a Hortonworks forum post) is not sufficient to be acknowledges as "existing" on the cluster.

为存在于所有群集节点上的其他用户(在本例中为Ambari创建的hdfs用户)运行hadoop jar命令(即使Ambari创建的该用户在节点之间具有不同的uid(如果存在问题则为IDK)) )并具有hdfs://user/hdfs目录,发现h2o jar运行正常.

Running the hadoop jar command for a different user (in this case, the Ambari-created hdfs user) that exists on all cluster nodes (even though Ambari created this user having different uids across nodes (IDK if this is a problem)) and has a hdfs://user/hdfs dir, found that the h2o jar ran as expected.

最初,我的印象是,用户只需要存在于正在使用的任何客户端计算机上,以及对hdfs://user/目录的需求(请参阅

I was initially under the impression that users only needed to exist on whatever client machine was being used plus the need for a hdfs://user/ dir (see https://community.cloudera.com/t5/Support-Questions/Adding-a-new-user-to-the-cluster/m-p/130319/highlight/true#M93005). One concerning / confusing thing that has come from this is the fact Ambari apparently created the hdfs user on the various cluster nodes with differing uid and gid values, eg...

[root@HW01 ~]# clush -ab id hdfs
---------------
HW[01-04] (4)
---------------
uid=1017(hdfs) gid=1005(hadoop) groups=1005(hadoop),1003(hdfs)
---------------
HW05
---------------
uid=1021(hdfs) gid=1006(hadoop) groups=1006(hadoop),1004(hdfs)
[root@HW01 ~]# 
[root@HW01 ~]#
# wondering what else is using a uid 1021 across the nodes 
[root@HW01 ~]# clush -ab id 1021
---------------
HW[01-04] (4)
---------------
uid=1021(hbase) gid=1005(hadoop) groups=1005(hadoop)
---------------
HW05
---------------
uid=1021(hdfs) gid=1006(hadoop) groups=1006(hadoop),1004(hdfs)

这似乎不是应该的样子(只是我怀疑使用过MapR(这要求uid和gid在节点之间必须相同),然后在此处查看:

This does not seem like that is how it is supposed to be (just my suspicion from having worked with MapR (which requires the uid and gids to be same across nodes) and looking here: https://www.ibm.com/support/knowledgecenter/en/STXKQY_BDA_SHR/bl1adv_userandgrpid.htm). Note that HW05 was a node that was added later. If this is actually fine in HDP, I plan to just add the user I actually indent to use h2o with across all the nodes with whatever arbitrary uid and gid values. Any thoughts on this? Any docs to support either why this is right or wrong you could link me to?

在发布答案之前,将对此进行更多的研究.我认为基本上需要对HDP何时考虑用户存在"在集群上进行更多的澄清.

Will look into this a bit more before posting as an answer. I think basically will need to look for a bit more clarification as to when HDP considers a user to "exist" on a cluster.

推荐答案

问题似乎与如何为HDP/由ambari创建的用户正确更改uid有关?并且只有一个用户存在于节点上并且具有具有正确权限的hdfs://user/目录的事实(正如我从Hortonworks论坛帖子中所认为的那样)不足以被确认为在集群中存在" .这让我与Hortonworks专家进行了讨论,他们说使用YARN的用户必须存在于群集的所有数据节点上.

Problem appears to be related to How to properly change uid for HDP / ambari-created user? and the fact that having a user exist on a node and have a hdfs://user/ directory with correct permissions (as I was lead to believe from a Hortonworks forum post) is not sufficient to be acknowledges as "existing" on the cluster. This jives with discussions I've had with Hortonworks experts where they have said that the YARN-using user must exist on all of the cluster's datanodes.

为存在于所有群集节点上的其他用户(在本例中为Ambari创建的hdfs用户)运行hadoop jar命令(即使Ambari创建的该用户在节点之间具有不同的uid(如果存在问题则为IDK)) ),并有一个hdfs://user/hdfs目录,发现h2o jar已按预期运行.

Running the hadoop jar command for a different user (in this case, the Ambari-created hdfs user) that exists on all cluster nodes (even though Ambari created this user having different uids across nodes (IDK if this is a problem)) and has a hdfs://user/hdfs dir, found that the h2o jar ran as expected.

最初,我的印象是,用户只需要存在于正在使用的任何客户端计算机上,以及对hdfs://user/目录的需求(请参阅

I was initially under the impression that users only needed to exist on whatever client machine was being used plus the need for a hdfs://user/ dir (see https://community.cloudera.com/t5/Support-Questions/Adding-a-new-user-to-the-cluster/m-p/130319/highlight/true#M93005).

旁注:

与此有关的一个令人困惑的事情是,Ambari显然在具有不同uid和gid值的各个群集节点上创建了hdfs用户,例如...

One concerning / confusing thing that has come from this is the fact Ambari apparently created the hdfs user on the various cluster nodes with differing uid and gid values, eg...

[root@HW01 ~]# clush -ab id hdfs
---------------
HW[01-04] (4)
---------------
uid=1017(hdfs) gid=1005(hadoop) groups=1005(hadoop),1003(hdfs)
---------------
HW05
---------------
uid=1021(hdfs) gid=1006(hadoop) groups=1006(hadoop),1004(hdfs)
[root@HW01 ~]# 
[root@HW01 ~]#
# wondering what else is using a uid 1021 across the nodes 
[root@HW01 ~]# clush -ab id 1021
---------------
HW[01-04] (4)
---------------
uid=1021(hbase) gid=1005(hadoop) groups=1005(hadoop)
---------------
HW05
---------------
uid=1021(hdfs) gid=1006(hadoop) groups=1006(hadoop),1004(hdfs)

这似乎不是应该的样子(只是我怀疑使用过MapR(这要求uid和gid在节点之间必须相同),然后在此处查看:

This does not seem like that is how it is supposed to be (just my suspicion from having worked with MapR (which requires the uid and gids to be same across nodes) and looking here: https://www.ibm.com/support/knowledgecenter/en/STXKQY_BDA_SHR/bl1adv_userandgrpid.htm). Note that HW05 was a node that was added later. If this is actually fine in HDP, I plan to just add the user I actually indent to use h2o with across all the nodes with whatever arbitrary uid and gid values. Any thoughts on this? Any docs to support either why this is right or wrong you could link me to?

在此处进行更多研究:使用奇怪的数字用户名值获取目录权限的HDFS NFS位置

Looking into this a bit more here: HDFS NFS locations using weird numerical username values for directory permissions

这篇关于YARN作业访问的资源似乎少于Ambari YARN管理器报告的资源的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆