Kubernetes 上的 Apache flink - 如果作业管理器崩溃,则恢复作业 [英] Apache flink on Kubernetes - Resume job if jobmanager crashes

查看:31
本文介绍了Kubernetes 上的 Apache flink - 如果作业管理器崩溃,则恢复作业的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 kubernetes 上运行 flink 作业,使用(持久)状态后端似乎崩溃任务管理器没有问题,因为如果我理解正确,他们可以询问作业管理器他们需要从哪个检查点恢复.

I want to run a flink job on kubernetes, using a (persistent) state backend it seems like crashing taskmanagers are no issue as they can ask the jobmanager which checkpoint they need to recover from, if I understand correctly.

崩溃的 jobmanager 似乎有点困难.在这个 flip-6 page 上,我读到 Zookeeper 需要是能够知道 jobmanager 需要使用什么检查点来恢复和领导选举.

A crashing jobmanager seems to be a bit more difficult. On this flip-6 page I read zookeeper is needed to be able to know what checkpoint the jobmanager needs to use to recover and for leader election.

看到 kubernetes 每次崩溃时都会重新启动 jobmanager,有没有办法让新的 jobmanager 恢复作业而无需设置 zookeeper 集群?

Seeing as kubernetes will restart the jobmanager whenever it crashes is there a way for the new jobmanager to resume the job without having to setup a zookeeper cluster?

我们目前正在考虑的解决方案是:当 kubernetes 想要杀死 jobmanager 时(例如因为它想将其移动到另一个虚拟机)然后创建一个保存点,但这仅适用于正常关闭.

The current solution we are looking at is: when kubernetes wants to kill the jobmanager (because it want to move it to another vm for example) and then create a savepoint, but this would only work for graceful shutdowns.

http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-HA-with-Kubernetes-without-Zookeeper-td15033.html 似乎很有趣,但有没有后续

http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-HA-with-Kubernetes-without-Zookeeper-td15033.html seems to be interesting but has no follow-up

推荐答案

开箱即用,Flink 需要一个 ZooKeeper 集群来从 JobManager 崩溃中恢复.但是,我认为您可以拥有 HighAvailabilityServicesCompletedCheckpointStoreCheckpointIDCounterSubmittedJobGraphStore 的轻量级实现,它们可以带来你很远.

Out of the box, Flink requires a ZooKeeper cluster to recover from JobManager crashes. However, I think you can have a lightweight implementation of the HighAvailabilityServices, CompletedCheckpointStore, CheckpointIDCounter and SubmittedJobGraphStore which can bring you quite far.

假设你只有一个 JobManager 一直在运行(不完全确定 K8s 是否可以保证这一点)并且你有一个持久的存储位置,你可以实现一个 CompletedCheckpointStore 来检索已完成的检查点来自持久存储系统(例如读取所有存储的检查点文件).此外,您将拥有一个文件,其中包含 CheckpointIDCounter 的当前检查点 id 计数器以及 SubmittedJobGraphStore 的所有已提交作业图.因此,基本思想是将所有内容存储在单个 JobManager 可访问的持久卷上.

Given that you have only one JobManager running at all times (not entirely sure whether K8s can guarantee this) and that you have a persistent storage location, you could implement a CompletedCheckpointStore which retrieves the completed checkpoints from the persistent storage system (e.g. reading all stored checkpoint files). Additionally, you would have a file which contains the current checkpoint id counter for CheckpointIDCounter and all the submitted job graphs for the SubmittedJobGraphStore. So the basic idea is to store everything on a persistent volume which is accessible by the single JobManager.

这篇关于Kubernetes 上的 Apache flink - 如果作业管理器崩溃,则恢复作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆