为什么 Yarn on EMR 不分配所有节点来运行 Spark 作业? [英] Why does Yarn on EMR not allocate all nodes to running Spark jobs?

查看:44
本文介绍了为什么 Yarn on EMR 不分配所有节点来运行 Spark 作业?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在 Amazon Elastic Map Reduce (EMR) 上的 Apache Spark 上运行一项作业.目前我在 emr-4.1.0 上运行,其中包括 Amazon Hadoop 2.6.0 和 Spark 1.5.0.

当我开始作业时,YARN 已正确地将所有工作节点分配给了 spark 作业(当然,其中一个用于驱动程序).

我将神奇的maximizeResourceAllocation"属性设置为true",而 spark 属性spark.dynamicAllocation.enabled"也设置为true".

但是,如果我通过向工作机器的 CORE 池添加节点来调整 emr 集群的大小,YARN 只会将 一些 新节点添加到 spark 作业中.

例如,今天早上我有一个工作使用 26 个节点(m3.2xlarge,如果这很重要) - 1 个用于驱动程序,25 个执行程序.我想加快工作速度,所以我尝试再添加 8 个节点.YARN 已选择所有新节点,但仅将其中 1 个分配给 Spark 作业.Spark 确实成功获取了新节点并将其用作执行器,但我的问题是为什么 YARN 让其他 7 个节点闲置?

这很烦人,原因很明显 - 即使资源没有被使用,我也必须为这些资源付费,而且我的工作根本没有加速!

有人知道 YARN 如何决定何时向运行的 Spark 作业添加节点吗?什么变量起作用?记忆?V 核?有什么吗?

提前致谢!

解决方案

好的,在 @sean_r_owen,我能够追踪到这一点.

问题在于:当将 spark.dynamicAllocation.enabled 设置为 true 时,不应设置 spark.executor.instances -一个明确的值将覆盖动态分配并将其关闭.事实证明,如果您不自己设置,EMR 会将其设置在后台.要获得所需的行为,您需要将 spark.executor.instances 显式设置为 0.

为了记录,这里是我们在创建 EMR 集群时传递给 --configurations 标志的文件之一的内容:

<预><代码>[{"分类": "容量调度器",特性": {"yarn.scheduler.capacity.resource-calculator": "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"}},{"分类": "火花",特性": {最大化资源分配":真"}},{分类":火花默认",特性": {"spark.dynamicAllocation.enabled": "true",spark.executor.instances":0"}}]

这为我们提供了一个 EMR 集群,其中 Spark 在运行作业时使用所有节点,包括添加的节点.它似乎也使用了所有/大部分内存和所有(?)内核.

(我不完全确定它是否使用了所有实际内核;但它肯定使用了 1 个以上的 VCore,这是以前没有的,但是按照 Glennie Helles 的建议,它现在表现更好,并且使用了一半的内核列出的 VCores,这似乎等于实际的核心数...)

I'm running a job on Apache Spark on Amazon Elastic Map Reduce (EMR). Currently I'm running on emr-4.1.0 which includes Amazon Hadoop 2.6.0 and Spark 1.5.0.

When I start the job, YARN correctly has allocated all the worker nodes to the spark job (with one for the driver, of course).

I have the magic "maximizeResourceAllocation" property set to "true", and the spark property "spark.dynamicAllocation.enabled" also set to "true".

However, if I resize the emr cluster by adding nodes to the CORE pool of worker machines, YARN only adds some of the new nodes to the spark job.

For example, this morning I had a job that was using 26 nodes (m3.2xlarge, if that matters) - 1 for the driver, 25 executors. I wanted to speed up the job so I tried adding 8 more nodes. YARN has picked up all of the new nodes, but only allocated 1 of them to the Spark job. Spark did successfully pick up the new node and is using it as an executor, but my question is why is YARN letting the other 7 nodes just sit idle?

It's annoying for obvious reasons - I have to pay for the resources even though they're not being used, and my job hasn't sped up at all!

Anybody know how YARN decides when to add nodes to running spark jobs? What variables come into play? Memory? V-Cores? Anything?

Thanks in advance!

解决方案

Okay, with the help of @sean_r_owen, I was able to track this down.

The problem was this: when setting spark.dynamicAllocation.enabled to true, spark.executor.instances shouldn't be set - an explicit value for that will override dynamic allocation and turn it off. It turns out that EMR sets it in the background if you do not set it yourself. To get the desired behaviour, you need to explicitly set spark.executor.instances to 0.

For the records, here is the contents of one of the files we pass to the --configurations flag when creating an EMR cluster:

[
    {
        "Classification": "capacity-scheduler",
        "Properties": {
            "yarn.scheduler.capacity.resource-calculator": "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
        }
    },

    {
        "Classification": "spark",
        "Properties": {
            "maximizeResourceAllocation": "true"
        }
    },

    {
        "Classification": "spark-defaults",
        "Properties": {
            "spark.dynamicAllocation.enabled": "true",
            "spark.executor.instances": "0"
        }
    } 
]

This gives us an EMR cluster where Spark uses all the nodes, including added nodes, when running jobs. It also appears to use all/most of the memory and all (?) the cores.

(I'm not entirely sure that it's using all the actual cores; but it is definitely using more than 1 VCore, which it wasn't before, but following Glennie Helles's advice it is now behaving better and using half of the listed VCores, which seems to equal the actual number of cores...)

这篇关于为什么 Yarn on EMR 不分配所有节点来运行 Spark 作业?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆