了解flink保存点&检查点 [英] Understanding flink savepoints & checkpoints
问题描述
将Apache Flink流应用程序与这样的管道一起考虑:
Considering an Apache Flink streaming-application with a pipeline like this:
Kafka-Source -> flatMap 1 -> flatMap 2 -> flatMap 3 -> Kafka-Sink
其中每个flatMap
函数都是无状态运算符(例如Datastream
的常规.flatMap
函数).
where every flatMap
function is a non-stateful operator (e.g. the normal .flatMap
function of a Datastream
).
万一传入的消息将在flatMap 3
挂起,检查点/保存点如何工作?从flatMap 1
开始重新启动后,消息将被重新处理还是跳到flatMap 3
?
How do checkpoints/savepoints work, in case an incoming message will be pending at flatMap 3
? Will the message be reprocessed after restart beginning from flatMap 1
or will it skip to flatMap 3
?
我有点困惑,因为 ,还是在失败/重新启动后重新处理整个管道?
I am a bit confused, because the documentation seems to refer to application state as what I can use in stateful operators, but I don't have stateful operators in my application. Is the "processing progress" saved & restored at all, or will the whole pipeline be re-processed after a failure/restart?
关于我以前的问题,失败(->从检查点恢复flink)和使用保存点手动重启之间有区别吗?
And this there a difference between a failure (-> flink restores from checkpoint) and manual restart using savepoints regarding my previous questions?
我试图通过在flatMap 3
中放置Thread.sleep()
并通过保存点取消作业来发现自己(使用EXACTLY_ONCE
和rocksdb-backend启用了检查点).但是,这导致flink
命令行工具挂起,直到sleep
结束,甚至在执行flatMap 3
并将其发送到接收器 之前,工作被取消了.因此,似乎无法手动强制这种情况来分析flink的行为.
I tried finding out myself (with enabled checkpointing using EXACTLY_ONCE
and rocksdb-backend) by placing a Thread.sleep()
in flatMap 3
and then cancelling the job with a savepoint. However this lead to the flink
commandline tool hanging until the sleep
was over, and even then flatMap 3
was executed and even sent out to the sink before the job got cancelled. So it seems I can not manually force this situation to analyze flink's behaviour.
如果如上所述,处理进度"没有被检查点/保存点保存/覆盖,我如何确保到达我的管道的每条消息都永远不会重新使用任何给定的运算符(平面图1/2/3) -在重启/失败情况下处理?
In case "processing progress" is not saved/covered by the checkpointing/savepoints as I described above, how could I make sure for every message reaching my pipeline that any given operator (flatmap 1/2/3) is never re-processed in a restart/failure situation?
推荐答案
采用检查点时,每个任务(运算符的并行实例)都会检查其状态.在您的示例中,三个平面图运算符是无状态的,因此没有要检查的状态. Kafka源是有状态的,并检查所有分区的读取偏移量.
When a checkpoint is taken, every task (parallel instance of an operator) checkpoints its state. In your example, the three flatmap operators are stateless, so there is no state to be checkpointed. The Kafka source is stateful and checkpoints the reading offsets for all partitions.
在失败的情况下,将恢复作业,并且所有任务均会加载其状态,这意味着在源操作员的情况下,将重置读取偏移量.因此,该应用程序将重新处理自上一个检查点以来的所有事件.
In case of a failure, the job is recovered and all tasks load their state which means in case of the source operator that the reading offsets are reset. Hence, the application will reprocess all events since the last checkpoint.
为了一次实现端到端的精确性,您需要一个特殊的接收器连接器,该连接器提供事务支持(例如,针对Kafka)或支持幂等写入.
In order to achieve end-to-end exactly-once, you need a special sink connector that offers either transaction support (e.g., for Kafka) or supports idempotent writes.
这篇关于了解flink保存点&检查点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!