Flink 检查点失败 - 检查点在 10 分钟后超时 [英] Flink Checkpoint Failure - Checkpoints time out after 10 mins

查看:68
本文介绍了Flink 检查点失败 - 检查点在 10 分钟后超时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们每天在处理数据的过程中都会遇到一两次 CheckPoint 故障.数据量很低,比如在 10k 以下,我们的间隔设置是2 分钟".(处理很慢的原因是我们需要将数据sink到另一个API端点,在flink作业结束时需要一些时间来处理,所以时间是Streaming data + Sink to external API endpoint).

We got one or two CheckPoint Failure during processing data every day. The data volume is low, like under 10k, and our interval setting is '2 minutes'. (The reason for processing very slow is we need to sink the data to another API endpoint which take some time to process at the end of flink job, so the time is Streaming data + Sink to external API endpoint).

根本问题是:Checkpoints 在 10 分钟后超时,这是由于数据处理时间超过 10 分钟造成的,所以检查点超时.我们可能会增加并行度以加快处理速度,但是如果数据变大,我们必须再次增加并行度,所以不要使用这种方式.

The root issue is: Checkpoints time out after 10 mins, this caused by the data processing time longer than 10 mins, so the checkpoint time out. We might increase the parallelism to fast the processing, but if the data become bigger, we have to increase the parallelism again, so don't want to use this way.

建议的解决方案:我看到有人建议在新旧检查点之间设置暂停,但我有一个问题是,如果我在那里设置暂停时间,新检查点是否会在暂停时间内丢失状态?

Suggested solution: I saw someone suggest to set the pause between old and new checkpoint, but I have some question here is, if I set the pause time there, will the new checkpoint missing the state in the pause time?

目标:如何避免这个问题,记录正确的状态,不会遗漏任何数据?

Aim: How to avoid this issue and record the correct state that doesn't miss any data?

失败的检查点:在此处输入图片描述

完成的检查点:在此处输入图片描述

子任务没有响应在此处输入图片描述

谢谢

推荐答案

您可以设置几个相关的配置变量——例如检查点间隔、检查点之间的暂停以及并发检查点的数量.这些设置的任何组合都不会导致数据被跳过以进行检查点.

There are several related configuration variables you can set -- such as the checkpoint interval, the pause between checkpoints, and the number of concurrent checkpoints. No combination of these settings will result in data being skipped for checkpointing.

设置检查点之间的间隔意味着 Flink 不会在上一个检查点完成(或失败)后的一段时间内启动新的检查点——但这对超时没有影响.

Setting an interval between checkpoints means that Flink won't initiate a new checkpoint until some time has passed since the completion (or failure) of the previous checkpoint -- but this has no effect on the timeout.

听起来您应该延长超时时间,您可以这样做:

Sounds like you should extend the timeout, which you can do like this:

env.getCheckpointConfig().setCheckpointTimeout(n);

其中 n 以毫秒为单位.请参阅有关 启用和配置检查点了解更多详情.

where n is measured in milliseconds. See the section of the Flink docs on enabling and configuring checkpointing for more details.

这篇关于Flink 检查点失败 - 检查点在 10 分钟后超时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆