Spark:每个节点在一个作业中有多个执行程序有什么优势? [英] Spark: what's the advantages of having multiple executors per node for a Job?

查看:186
本文介绍了Spark:每个节点在一个作业中有多个执行程序有什么优势?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在AWS-EMR集群上运行我的工作.它是一个使用cr1.8xlarge实例的40个节点的群集.每个cr1.8xlarge具有240G内存和32个内核.我可以使用以下配置运行:

I am running my job on AWS-EMR cluster. It is a 40 nodes cluster using cr1.8xlarge instances. Each cr1.8xlarge has 240G memory and 32 cores. I can run with the following config:

--driver-memory 180g --driver-cores 26 --executor-memory 180g --executor-cores 26 --num-executors 40 --conf spark.default.parallelism=4000

--driver-memory 180g --driver-cores 26 --executor-memory 90g --executor-cores 13 --num-executors 80 --conf spark.default.parallelism=4000

由于是在Job-tracker网站上进行的,因此同时运行的任务数主要只是可用内核数(cpu).因此,我想知道我们是否希望在每个节点上拥有多个执行者有任何优势或特定情况?

Since from the job-tracker website, the number of tasks running simultaneously is mainly just the number of cores (cpu) available. So I am wondering is there any advantages or specific scenarios that we want to have more than one executor per node?

谢谢!

推荐答案

是的,每个节点运行多个执行程序有很多优点-特别是在像您这样的大型实例上.我建议您阅读此博客文章来自Cloudera.

Yes, there are advantages of running multiple executors per node - especially on large instances like yours. I recommend that you read this blog post from Cloudera.

您特别感兴趣的帖子摘要:

A snippet of the post that would be of particular interest to you:

为使这一切更加具体,下面是一个配置Spark应用程序以使用尽可能多的集群的可行示例:想象一个集群,其中有六个运行NodeManager的节点,每个节点配备16个内核和64GB的记忆. NodeManager的yarn.nodemanager.resource.memory-mb和yarn.nodemanager.resource.cpu-vcores容量应分别设置为63 * 1024 = 64512(兆字节)和15.我们避免将100%的资源分配给YARN容器,因为该节点需要一些资源来运行OS和Hadoop守护程序.在这种情况下,我们为这些系统进程保留了一个千兆字节和一个内核. Cloudera Manager通过考虑这些因素并自动配置这些YARN属性来提供帮助.

To hopefully make all of this a little more concrete, here’s a worked example of configuring a Spark app to use as much of the cluster as possible: Imagine a cluster with six nodes running NodeManagers, each equipped with 16 cores and 64GB of memory. The NodeManager capacities, yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, should probably be set to 63 * 1024 = 64512 (megabytes) and 15 respectively. We avoid allocating 100% of the resources to YARN containers because the node needs some resources to run the OS and Hadoop daemons. In this case, we leave a gigabyte and a core for these system processes. Cloudera Manager helps by accounting for these and configuring these YARN properties automatically.

可能的第一个冲动是使用--num-executors 6 --executor-cores 15 --executor-memory 63G.但是,这是错误的方法,因为:

The likely first impulse would be to use --num-executors 6 --executor-cores 15 --executor-memory 63G. However, this is the wrong approach because:

63GB +执行程序的内存开销无法满足NodeManagers 63GB的容量要求. 应用程序主节点将在其中一个节点上占用一个核心,这意味着该节点上将没有空间容纳15个核心执行程序. 每个执行程序15个内核可能会导致HDFS I/O吞吐量下降. 更好的选择是使用--num-executors 17 --executor-cores 5 --executor-memory 19G.为什么?

63GB + the executor memory overhead won’t fit within the 63GB capacity of the NodeManagers. The application master will take up a core on one of the nodes, meaning that there won’t be room for a 15-core executor on that node. 15 cores per executor can lead to bad HDFS I/O throughput. A better option would be to use --num-executors 17 --executor-cores 5 --executor-memory 19G. Why?

此配置会导致除具有AM的一个节点之外的所有节点上的三个执行器,后者将具有两个执行器. --executor-memory的推导为(每个节点63/3个执行器)=21.21* 0.07 = 1.47. 21 – 1.47〜19.

This config results in three executors on all nodes except for the one with the AM, which will have two executors. --executor-memory was derived as (63/3 executors per node) = 21. 21 * 0.07 = 1.47. 21 – 1.47 ~ 19.

这篇关于Spark:每个节点在一个作业中有多个执行程序有什么优势?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆