Spark Standalone 集群中的 worker、executor、core 是什么? [英] What are workers, executors, cores in Spark Standalone cluster?

查看:35
本文介绍了Spark Standalone 集群中的 worker、executor、core 是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我阅读了

Spark 使用主/从架构.如图所示,它有一个中央协调器(Driver)与许多分布式工作器(执行器)进行通信.驱动程序和每个执行程序都在自己的 Java 进程中运行.

司机

驱动是main方法运行的进程.首先它将用户程序转换为任务,然后在执行器上调度任务.

执行者

Executors 是工作节点的进程,负责在给定的 Spark 作业中运行单个任务.它们在 Spark 应用程序开始时启动,通常在应用程序的整个生命周期内运行.一旦他们运行了任务,他们就会将结果发送给驱动程序.它们还为用户程序通过块管理器缓存的 RDD 提供内存存储.

应用程序执行流程

考虑到这一点,当您使用 spark-submit 向集群提交应用程序时,内部会发生这种情况:

  1. 一个独立的应用程序会启动并实例化一个 SparkContext 实例(只有在那时你才能将该应用程序称为驱动程序).
  2. 驱动程序向集群管理器请求资源以启动执行程序.
  3. 集群管理器启动执行程序.
  4. 驱动程序进程贯穿用户应用程序.根据 RDD 上的操作和转换,将任务发送给执行程序.
  5. 执行者运行任务并保存结果.
  6. 如果任何一个 worker 崩溃,它的任务将被发送到不同的 executor 以再次处理.在Learning Spark:Lightning-Fast Big Data Analysis"一书中,他们谈到了 Spark 和 Fault Tolerance:

<块引用>

Spark 通过重新执行失败或缓慢的任务来自动处理失败或缓慢的机器.例如,如果运行 map() 操作分区的节点崩溃,Spark 将在另一个节点上重新运行它;即使节点没有崩溃,只是比其他节点慢得多,Spark 也可以抢先在另一个节点上启动任务的推测性"副本,并在完成后获取其结果.

  1. 使用来自驱动程序的 SparkContext.stop() 或者如果 main 方法退出/崩溃,所有执行程序都将终止,集群资源将由集群管理器释放.

您的问题

  1. 当执行器启动时,它们向驱动程序注册自己,然后它们直接通信.工作人员负责向集群管理器传达其资源的可用性.

  2. 在 YARN 集群中,您可以使用 --num-executors 来做到这一点.在独立集群中,除非您使用 spark.executor.cores 并且一个 worker 有足够的内核来容纳多个 executor,否则每个 worker 将获得一个 executor.(正如@JacekLaskowski 指出的,--num-executors 在 YARN 中不再使用 https://github.com/apache/spark/commit/16b6d18613e150c7038c613992d80a7828413e66)

  3. 您可以使用 --executor-cores 为每个执行程序分配内核数

  4. --total-executor-cores 是每个应用程序的最大执行程序核心数

  5. 正如 Sean Owen 在这个 线程:没有一个很好的理由让每台机器运行一个以上的工人".例如,一台机器上会有多个 JVM.

更新

我无法测试这种情况,但根据文档:

示例 1: Spark 将贪婪地获取调度程序提供的尽可能多的内核和执行程序.所以最后你会得到 5 个执行器,每个执行器有 8 个内核.

示例 2 到 5:Spark 将无法在单个工作器中分配所需数量的内核,因此不会启动任何执行程序.

I read Cluster Mode Overview and I still can't understand the different processes in the Spark Standalone cluster and the parallelism.

Is the worker a JVM process or not? I ran the bin\start-slave.sh and found that it spawned the worker, which is actually a JVM.

As per the above link, an executor is a process launched for an application on a worker node that runs tasks. An executor is also a JVM.

These are my questions:

  1. Executors are per application. Then what is the role of a worker? Does it co-ordinate with the executor and communicate the result back to the driver? Or does the driver directly talks to the executor? If so, what is the worker's purpose then?

  2. How to control the number of executors for an application?

  3. Can the tasks be made to run in parallel inside the executor? If so, how to configure the number of threads for an executor?

  4. What is the relation between a worker, executors and executor cores ( --total-executor-cores)?

  5. What does it mean to have more workers per node?

Updated

Let's take examples to understand better.

Example 1: A standalone cluster with 5 worker nodes (each node having 8 cores) When I start an application with default settings.

Example 2 Same cluster config as example 1, but I run an application with the following settings --executor-cores 10 --total-executor-cores 10.

Example 3 Same cluster config as example 1, but I run an application with the following settings --executor-cores 10 --total-executor-cores 50.

Example 4 Same cluster config as example 1, but I run an application with the following settings --executor-cores 50 --total-executor-cores 50.

Example 5 Same cluster config as example 1, but I run an application with the following settings --executor-cores 50 --total-executor-cores 10.

In each of these examples, How many executors? How many threads per executor? How many cores? How is the number of executors decided per application? Is it always the same as the number of workers?

解决方案

Spark uses a master/slave architecture. As you can see in the figure, it has one central coordinator (Driver) that communicates with many distributed workers (executors). The driver and each of the executors run in their own Java processes.

DRIVER

The driver is the process where the main method runs. First it converts the user program into tasks and after that it schedules the tasks on the executors.

EXECUTORS

Executors are worker nodes' processes in charge of running individual tasks in a given Spark job. They are launched at the beginning of a Spark application and typically run for the entire lifetime of an application. Once they have run the task they send the results to the driver. They also provide in-memory storage for RDDs that are cached by user programs through Block Manager.

APPLICATION EXECUTION FLOW

With this in mind, when you submit an application to the cluster with spark-submit this is what happens internally:

  1. A standalone application starts and instantiates a SparkContext instance (and it is only then when you can call the application a driver).
  2. The driver program ask for resources to the cluster manager to launch executors.
  3. The cluster manager launches executors.
  4. The driver process runs through the user application. Depending on the actions and transformations over RDDs task are sent to executors.
  5. Executors run the tasks and save the results.
  6. If any worker crashes, its tasks will be sent to different executors to be processed again. In the book "Learning Spark: Lightning-Fast Big Data Analysis" they talk about Spark and Fault Tolerance:

Spark automatically deals with failed or slow machines by re-executing failed or slow tasks. For example, if the node running a partition of a map() operation crashes, Spark will rerun it on another node; and even if the node does not crash but is simply much slower than other nodes, Spark can preemptively launch a "speculative" copy of the task on another node, and take its result if that finishes.

  1. With SparkContext.stop() from the driver or if the main method exits/crashes all the executors will be terminated and the cluster resources will be released by the cluster manager.

YOUR QUESTIONS

  1. When executors are started they register themselves with the driver and from so on they communicate directly. The workers are in charge of communicating the cluster manager the availability of their resources.

  2. In a YARN cluster you can do that with --num-executors. In a standalone cluster you will get one executor per worker unless you play with spark.executor.cores and a worker has enough cores to hold more than one executor. (As @JacekLaskowski pointed out, --num-executors is no longer in use in YARN https://github.com/apache/spark/commit/16b6d18613e150c7038c613992d80a7828413e66)

  3. You can assign the number of cores per executor with --executor-cores

  4. --total-executor-cores is the max number of executor cores per application

  5. As Sean Owen said in this thread: "there's not a good reason to run more than one worker per machine". You would have many JVM sitting in one machine for instance.

UPDATE

I haven't been able to test this scenarios, but according to documentation:

EXAMPLE 1: Spark will greedily acquire as many cores and executors as are offered by the scheduler. So in the end you will get 5 executors with 8 cores each.

EXAMPLE 2 to 5: Spark won't be able to allocate as many cores as requested in a single worker, hence no executors will be launch.

这篇关于Spark Standalone 集群中的 worker、executor、core 是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆