如何在纱线群集模式确实为星火驱动程序故障转移过程的工作(及其YARN容器)资源/文档 [英] Resources/Documentation on how does the failover process work for the Spark Driver (and its YARN Container) in yarn-cluster mode
问题描述
我想了解是否星火驱动程序是一个单一故障点集群模式部署纱时。所以,我想获得关于这方面星火驱动的纱线集装箱故障转移过程的内部结构的一个更好的把握。
I'm trying to understand if the Spark Driver is a single point of failure when deploying in cluster mode for Yarn. So I'd like to get a better grasp of the innards of the failover process regarding the YARN Container of the Spark Driver in this context.
我知道星火驱动程序将在纱线Container中的星火应用程序主运行。如果需要,星火应用法师会请求资源纱线资源管理器。但我一直没能找到有足够的细节有关星火应用法师YARN容器的情况下故障转移过程的文件(和星火驱动程序)失败。
I know that the Spark Driver will run in the Spark Application Master inside a Yarn Container. The Spark Application Master will request resources to the YARN Resource Manager if required. But I haven't been able to find a document with enough detail about the failover process in the event of the YARN Container of the Spark Application Master (and Spark driver) failing.
我试图找出一些详细的资源,可以让我回答有关以下情形的一些问题:如纱Container的主机运行星火应用主/星火司机损失的网络连接1小时:
I'm trying to find out some detailed resources that can allow me to answer some questions related to the following scenario: If the host machine of the YARN Container that runs the Spark Application Master / Spark Driver losses network connectivity for 1 hour:
-
是否YARN资源管理器生成一个新的集装箱纱与另一星火应用主/星火驱动程序?
Does the YARN Resource Manager spawn a new YARN Container with another Spark Application Master/Spark Driver?
在这种情况下,(产卵新型纱线容器),它从头开始星火驱动程序,如果在执行者1至少1阶段已经结束,并通知这样原来的驱动程序之前失败了吗?在是否使用选项坚持()在这里做一个区别?并且将在新的Spark司机知道遗嘱执行人已完成1阶段?会的Tachyon帮助解决这个情况?
In that case (spawning a new YARN Container), does it start the Spark Driver from scratch if at least 1 stage in 1 of the Executors had been completed and notified as such to the original Driver before it failed? Does the option used in persist() make a difference here? And will the new Spark Driver know that the executor had completed 1 stage? Would Tachyon help out in this scenario?
请问如果网络连接在原星火申请硕士的纱线Container的主机恢复故障恢复过程被触发?我想,这种行为可以从纱线被控制,但我不知道在部署群集模式SPARK什么时候是默认的。
Does a failback process get triggered if network connectivity is recovered in the YARN Container's host machine of the original Spark Application Master? I guess that this behaviour can be controlled from YARN, but I don't know what's the default when deploying SPARK in cluster mode.
我真的AP preciate它,如果你可以点我到一些文件/网页,其中星火纱线集群模式的架构和故障转移过程进行了详细的探讨。
I'd really appreciate it if you can point me out to some documents / web pages where the Architecture of Spark in yarn-cluster mode and the failover process are explored in detail.
推荐答案
我们刚刚开始跑纱,所以我不知道多少。但我几乎可以肯定,我们不得不在司机的水平没有自动故障转移。 (我们实现了一些我们自己的。)
We just started running on YARN, so I don't know much. But I'm almost certain we had no automatic failover at the driver's level. (We implemented some on our own.)
我不希望那里是驱动程序的任何默认故障转移解决方案。你(驱动程序的作者)是谁,知道如何健康检查应用程序的唯一的一个。而生活在驱动程序中的状态是不是可以被自动序列化。当SparkContext被破坏,它创建的所有RDDS丢失,因为它们没有运行的应用程序毫无意义的。
I would not expect there to be any default failover solution for the driver. You (the driver author) are the only one who knows how to health-check your application. And the state that lives in the driver is not something that can be automatically serialized. When a SparkContext is destroyed, all the RDDs created in it are lost, because they are meaningless without the running application.
我们已经实现了恢复策略是非常简单的。每一个昂贵的星火操作后,我们进行手动检查点。我们的RDD保存到磁盘(想想 saveAsTextFile
),并载入它的时候了。这将删除RDD的血统,所以它会被重新加载,而如果一个分区丢失比重新计算。
The recovery strategy we have implemented is very simple. After every costly Spark operation we make a manual checkpoint. We save the RDD to disk (think saveAsTextFile
) and load it back right away. This erases the lineage of the RDD, so it will be reloaded rather than recalculated if a partition is lost.
我们还可以存储我们所做的和文件名。因此,如果驱动程序重新启动后,它可以选择离开的地方,在这样操作的粒度。
We also store what we have done and the file name. So if the driver restarts, it can pick up where it left off, at the granularity of such operations.
这篇关于如何在纱线群集模式确实为星火驱动程序故障转移过程的工作(及其YARN容器)资源/文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!