Kubernetes上的Apache Flink-如果Jobmanager崩溃则恢复工作 [英] Apache flink on Kubernetes - Resume job if jobmanager crashes

查看:197
本文介绍了Kubernetes上的Apache Flink-如果Jobmanager崩溃则恢复工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用(持久)状态后端在kubernetes上运行flink作业,看来崩溃的任务管理器没有问题,因为如果我理解正确的话,他们可以询问作业管理器他们需要从哪个检查点恢复.

I want to run a flink job on kubernetes, using a (persistent) state backend it seems like crashing taskmanagers are no issue as they can ask the jobmanager which checkpoint they need to recover from, if I understand correctly.

崩溃的工作经理似乎要困难一些.在此翻转6页我读到动物园管理员需要能够知道工作经理需要使用哪个检查点进行恢复和领导者选举.

A crashing jobmanager seems to be a bit more difficult. On this flip-6 page I read zookeeper is needed to be able to know what checkpoint the jobmanager needs to use to recover and for leader election.

看到kubernetes会在崩溃时重新启动jobmanager,是否有办法让新的jobmanager在无需设置Zookeeper集群的情况下恢复工作?

Seeing as kubernetes will restart the jobmanager whenever it crashes is there a way for the new jobmanager to resume the job without having to setup a zookeeper cluster?

我们正在寻找的当前解决方案是:当kubernetes想杀死jobmanager时(例如,因为它想将其移动到另一个vm),然后创建一个保存点,但这仅适用于正常关机.

The current solution we are looking at is: when kubernetes wants to kill the jobmanager (because it want to move it to another vm for example) and then create a savepoint, but this would only work for graceful shutdowns.

http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-HA-with-Kubernetes-without-Zookeeper-td15033.html seems to be interesting but has no follow-up

推荐答案

开箱即用,Flink需要ZooKeeper群集才能从JobManager崩溃中恢复.但是,我认为您可以对HighAvailabilityServicesCompletedCheckpointStoreCheckpointIDCounterSubmittedJobGraphStore进行轻量级实现,这可以带给您很大的帮助.

Out of the box, Flink requires a ZooKeeper cluster to recover from JobManager crashes. However, I think you can have a lightweight implementation of the HighAvailabilityServices, CompletedCheckpointStore, CheckpointIDCounter and SubmittedJobGraphStore which can bring you quite far.

鉴于您始终始终只运行一个JobManager(不能完全确定K8是否可以保证这一点),并且您拥有一个持久性存储位置,则可以实现CompletedCheckpointStore,它可以从持久性存储系统中检索完成的检查点. (例如,读取所有存储的检查点文件).此外,您将拥有一个文件,其中包含CheckpointIDCounter的当前检查点ID计数器和SubmittedJobGraphStore的所有已提交作业图.因此,基本思想是将所有内容存储在单个JobManager可以访问的持久卷上.

Given that you have only one JobManager running at all times (not entirely sure whether K8s can guarantee this) and that you have a persistent storage location, you could implement a CompletedCheckpointStore which retrieves the completed checkpoints from the persistent storage system (e.g. reading all stored checkpoint files). Additionally, you would have a file which contains the current checkpoint id counter for CheckpointIDCounter and all the submitted job graphs for the SubmittedJobGraphStore. So the basic idea is to store everything on a persistent volume which is accessible by the single JobManager.

这篇关于Kubernetes上的Apache Flink-如果Jobmanager崩溃则恢复工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆