如何使Spark驱动程序具有对Master的弹性重启? [英] How to make Spark driver resilient to Master restarts?

查看:85
本文介绍了如何使Spark驱动程序具有对Master的弹性重启?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Spark Standalone(不是YARN/Mesos)集群和一个正在(在客户端模式下)运行的驱动程序应用程序,该应用程序与该集群通信以执行其任务.但是,如果我关闭并重新启动Spark master和worker,驱动程序将不会重新连接到master并恢复其工作.

也许我对Spark Master和驱动程序之间的关系感到困惑.在这种情况下,主站是否负责重新连接到驱动程序?如果是这样,主服务器是否将其当前状态序列化到磁盘,以便在重新启动后可以还原该磁盘?

解决方案

在这种情况下,师父负责重新连接回到司机?如果是这样,主服务器是否会序列化其当前状态到重新启动后可以还原的磁盘上?

主节点和驱动程序之间的关系取决于一些因素.首先,驱动程序是托管您的 SparkContext / StreamingContext 的驱动程序,并负责作业的执行.它是创建DAG的对象,并保存分别分配阶段/任务的 DAGScheduler TaskScheduler .如果您使用Spark Standalone并在客户端模式"下运行作业,则主节点可以充当驱动程序的主机.这样,主控方也托管驱动程序进程,如果驱动程序进程死亡,则驱动程序也随之死亡.如果使用集群模式",则驱动程序驻留在一个Worker节点上,并与Master频繁通信以获取当前正在运行的作业的状态,发回有关已完成批处理的状态的元数据等.

在独立服务器上运行时,如果主服务器死亡并重新启动它,则主服务器不会重新执行以前运行的作业.为了实现此目的,您可以创建集群并为其提供一个附加的Master节点,并对其进行设置,以便ZooKeeper可以保持Masters状态,并在发生故障时在两者之间进行互换.当您以这种方式设置集群时,主服务器知道它先前执行的作业,并由新的主服务器领导您代表他们恢复它们.

您可以阅读如何创建备用Spark Master节点 解决方案

In a situation like this, is the Master responsible for reconnecting back to the driver? If so, does the Master serialize its current state to disk somewhere that it can restore on restart?

The relationship between the Master node and the driver depends on a few factors. First, the driver is the one hosting your SparkContext/StreamingContext and is the in charge of the jobs execution. It is the one that creates the DAG, and holds the DAGScheduler and TaskScheduler which assign stages/tasks respectively. The Master Node may serve as the host for the driver in case you use Spark Standalone and run your job in "Client Mode". That way, the Master also hosts the driver process and if it dies the driver dies as with it. In case "Cluster mode" is used, the driver resides on one of the Worker nodes, and communicates with the Master frequently to get the status of the current running job, send back metadata regarding the status of the completed batches, etc.

Running on Standalone, if the Master dies and you restart it, the Master does not re-execute the jobs that were previously ran. In order to achieve this, you can create and provide the cluster with an additional Master node, and set it up so ZooKeeper can hold the Masters state, and interchange between the two in case of failure. When you set up the cluster in such a way, the Master knows about it's previously executed jobs and resumes them on your behalf the new Master has taken the lead.

You can read how to create a standby Spark Master node in the documentation.

这篇关于如何使Spark驱动程序具有对Master的弹性重启?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆