如果单个任务管理器崩溃,是否应该重新启动整个群集? [英] Should the entire cluster be restarted if a single Task Manager crashes?

查看:81
本文介绍了如果单个任务管理器崩溃,是否应该重新启动整个群集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在运行一个带有2个作业管理器和3个任务管理器的独立Flink集群.每当TM崩溃时,我们只需重新启动该特定TM并继续进行处理即可.

We're running a standalone Flink cluster with 2 Job Managers and 3 Task Managers. Whenever a TM crashes, we simply restart that particular TM and proceed with the processing.

但请阅读

But reading the comments on this question makes it look like we need to restart all the 5 nodes that form a cluster to deal with the failure of a single TM. Am I reading this right? What would be the consequences if we restart just the crashed TM and let the healthy ones run as is?

推荐答案

很抱歉,如果我在其他地方的答案不清楚,但是您在做什么就可以了.也许说工作正在重置"会更准确,这是自动发生的.由于检查点在全局上是一致的,因此所有任务管理器都应从检查点中记录的状态倒退并重新开始处理,这一点很重要,但是Flink会为您解决这一问题(一旦再次提供了必要的资源).

Sorry if I my answer elsewhere was unclear, but what you are doing is fine. Perhaps it would be more accurate to say that the job is being "reset", which happens automatically. Since checkpoints are globally consistent, it's important that all of the taskmanagers rewind and restart processing from the state recorded in the checkpoint, but Flink takes care of this for you (once the necessary resources are again made available).

这篇关于如果单个任务管理器崩溃,是否应该重新启动整个群集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆