Flink检查点失败-检查点在10分钟后超时 [英] Flink Checkpoint Failure - Checkpoints time out after 10 mins

查看:227
本文介绍了Flink检查点失败-检查点在10分钟后超时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们每天在处理数据时遇到一两个CheckPoint故障.数据量很低,例如不到10k,我们的间隔设置为"2分钟".(处理速度非常慢的原因是我们需要将数据下沉到另一个API端点,而这在flink作业结束时需要花费一些时间来处理,所以时间就是将数据+接收到流到外部API端点.)

We got one or two CheckPoint Failure during processing data every day. The data volume is low, like under 10k, and our interval setting is '2 minutes'. (The reason for processing very slow is we need to sink the data to another API endpoint which take some time to process at the end of flink job, so the time is Streaming data + Sink to external API endpoint).

根本问题是: 检查点在10分钟后超时,这是由于数据处理时间超过10分钟而导致的,因此检查点超时.我们可能会提高并行度以加快处理速度,但是如果数据变大,我们就必须再次提高并行度,所以不想使用这种方式.

The root issue is: Checkpoints time out after 10 mins, this caused by the data processing time longer than 10 mins, so the checkpoint time out. We might increase the parallelism to fast the processing, but if the data become bigger, we have to increase the parallelism again, so don't want to use this way.

建议的解决方案: 我看到有人建议在新检查点与新检查点之间设置暂停,但是我在这里有一个问题,如果我在此处设置暂停时间,新检查点是否会在暂停时间内丢失状态?

Suggested solution: I saw someone suggest to set the pause between old and new checkpoint, but I have some question here is, if I set the pause time there, will the new checkpoint missing the state in the pause time?

目标: 如何避免此问题并记录不丢失任何数据的正确状态?

Aim: How to avoid this issue and record the correct state that doesn't miss any data?

检查点失败:在此处输入图片描述

完成的检查点:在此处输入图片描述

子任务没有响应在此处输入图片描述

谢谢

推荐答案

您可以设置几个相关的配置变量-例如检查点间隔,检查点之间的暂停以及并发检查点的数量.这些设置的任何组合都不会导致为检查点跳过数据.

There are several related configuration variables you can set -- such as the checkpoint interval, the pause between checkpoints, and the number of concurrent checkpoints. No combination of these settings will result in data being skipped for checkpointing.

在检查点之间设置时间间隔意味着Flink在上一个检查点完成(或失败)后要过一段时间才会启动新的检查点,但这对超时没有影响.

Setting an interval between checkpoints means that Flink won't initiate a new checkpoint until some time has passed since the completion (or failure) of the previous checkpoint -- but this has no effect on the timeout.

听起来应该延长超时时间,您可以这样做:

Sounds like you should extend the timeout, which you can do like this:

env.getCheckpointConfig().setCheckpointTimeout(n);

其中 n 以毫秒为单位.请参阅

where n is measured in milliseconds. See the section of the Flink docs on enabling and configuring checkpointing for more details.

这篇关于Flink检查点失败-检查点在10分钟后超时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆