Apache Spark:内核数量与执行程序数量 [英] Apache Spark: The number of cores vs. the number of executors

查看:30
本文介绍了Apache Spark:内核数量与执行程序数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解在 YARN 上运行 Spark 作业时内核数量和执行程序数量之间的关系.

I'm trying to understand the relationship of the number of cores and the number of executors when running a Spark job on YARN.

测试环境如下:

  • 数据节点数:3
  • 数据节点机器规格:
    • CPU:Core i7-4790(核心数:4,线程数:8)
    • 内存:32GB (8GB x 4)
    • 硬盘:8TB (2TB x 4)

    网络:1Gb

    Spark 版本:1.0.0

    Spark version: 1.0.0

    Hadoop 版本:2.4.0 (Hortonworks HDP 2.1)

    Hadoop version: 2.4.0 (Hortonworks HDP 2.1)

    Spark 作业流程:sc.textFile -> filter -> map -> filter -> mapToPair -> reduceByKey -> map -> saveAsTextFile

    Spark job flow: sc.textFile -> filter -> map -> filter -> mapToPair -> reduceByKey -> map -> saveAsTextFile

    输入数据

    • 类型:单个文本文件
    • 大小:165GB
    • 行数:454,568,833

    输出

    • 第二个过滤器后的行数:310,640,717
    • 结果文件的行数:99,848,268
    • 结果文件的大小:41GB

    作业使用以下配置运行:

    The job was run with following configurations:

    1. --master yarn-client --executor-memory 19G --executor-cores 7 --num-executors 3(每个数据节点的执行器,使用与核心一样多)

    1. --master yarn-client --executor-memory 19G --executor-cores 7 --num-executors 3 (executors per data node, use as much as cores)

    --master yarn-client --executor-memory 19G --executor-cores 4 --num-executors 3(减少内核数量)

    --master yarn-client --executor-memory 4G --executor-cores 2 --num-executors 12(内核少,执行器多)

    经过的时间:

    1. 50 分 15 秒

    1. 50 min 15 sec

    55 分 48 秒

    31 分 23 秒

    令我惊讶的是,(3) 的速度要快得多.
    我认为(1)会更快,因为改组时执行器之间的通信会更少.
    尽管 (1) 的内核数少于 (3),但内核数并不是关键因素,因为 2) 确实表现良好.

    To my surprise, (3) was much faster.
    I thought that (1) would be faster, since there would be less inter-executor communication when shuffling.
    Although # of cores of (1) is fewer than (3), #of cores is not the key factor since 2) did perform well.

    (在 pwilmot 的回答之后添加了以下内容.)

    (Followings were added after pwilmot's answer.)

    有关信息,性能监视器屏幕截图如下:

    For the information, the performance monitor screen capture is as follows:

    • (1) 的 Ganglia 数据节点摘要 - 作业于 04:37 开始.

    • (3) 的 Ganglia 数据节点摘要 - 作业于 19:47 开始.请忽略在那之前的图表.

    该图大致分为两部分:

    • 第一:从开始到 reduceByKey:CPU 密集型,无网络活动
    • 第二:reduceByKey 后:CPU 降低,网络 I/O 完成.

    如图所示,(1) 可以使用尽可能多的 CPU 功率.所以,可能不是线程数的问题.

    As the graph shows, (1) can use as much CPU power as it was given. So, it might not be the problem of the number of the threads.

    如何解释这个结果?

    推荐答案

    希望让所有这些更具体一点,这里有一个配置 Spark 应用程序以使用尽可能多的集群的示例可能:想象一个有六个节点运行 NodeManagers 的集群,每个节点配备16 核和 64GB 内存.NodeManager 容量,yarn.nodemanager.resource.memory-mb 和yarn.nodemanager.resource.cpu-vcores,应该设置为 63 *1024 = 64512(兆字节)和 15.我们避免分配 100%YARN 容器的资源,因为节点需要一些运行操作系统和 Hadoop 守护进程的资源.在这种情况下,我们留下一个技嘉和这些系统进程的核心.Cloudera Manager 提供帮助通过考虑这些并配置这些 YARN 属性自动.

    To hopefully make all of this a little more concrete, here’s a worked example of configuring a Spark app to use as much of the cluster as possible: Imagine a cluster with six nodes running NodeManagers, each equipped with 16 cores and 64GB of memory. The NodeManager capacities, yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, should probably be set to 63 * 1024 = 64512 (megabytes) and 15 respectively. We avoid allocating 100% of the resources to YARN containers because the node needs some resources to run the OS and Hadoop daemons. In this case, we leave a gigabyte and a core for these system processes. Cloudera Manager helps by accounting for these and configuring these YARN properties automatically.

    可能的第一个冲动是使用 --num-executors 6--executor-cores 15 --executor-memory 63G.但是,这是错误的方法,因为:

    The likely first impulse would be to use --num-executors 6 --executor-cores 15 --executor-memory 63G. However, this is the wrong approach because:

    63GB + 执行器内存开销不适合 63GB 容量节点管理器.应用程序主将占用一个内核节点数,这意味着 15 核执行器将没有空间在那个节点上.每个执行程序 15 个内核会导致 HDFS I/O 错误吞吐量.

    63GB + the executor memory overhead won’t fit within the 63GB capacity of the NodeManagers. The application master will take up a core on one of the nodes, meaning that there won’t be room for a 15-core executor on that node. 15 cores per executor can lead to bad HDFS I/O throughput.

    更好的选择是使用 --num-executors 17--executor-cores 5 --executor-memory 19G.为什么?

    A better option would be to use --num-executors 17 --executor-cores 5 --executor-memory 19G. Why?

    这个配置会在所有节点上产生三个执行器,除了一个与 AM,它将有两个执行者.--executor-memory 推导出为(每个节点 63/3 个执行程序)= 21.21 * 0.07 = 1.47.21 – 1.47 ~ 19.

    This config results in three executors on all nodes except for the one with the AM, which will have two executors. --executor-memory was derived as (63/3 executors per node) = 21. 21 * 0.07 = 1.47. 21 – 1.47 ~ 19.

    Cloudera 博客的一篇文章中给出了解释,操作方法:调整 Apache Spark 作业(第 2 部分).

    The explanation was given in an article in Cloudera's blog, How-to: Tune Your Apache Spark Jobs (Part 2).

    这篇关于Apache Spark:内核数量与执行程序数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆