Flink一次准确的消息处理 [英] Flink exactly-once message processing

查看:214
本文介绍了Flink一次准确的消息处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经设置了一个Flink 1.2独立集群,其中包含2个JobManager和3个TaskManager,并且我正在使用JMeter通过产生Kafka消息/事件进行负载测试,然后对其进行处理.处理作业在TaskManager上运行,通常每秒需要约15K个事件.
作业已设置EXACTLY_ONCE检查点,并且正在将状态和检查点持久存储到Amazon S3. 如果我关闭运行该作业的TaskManager会花费一些时间(几秒钟),那么该作业将在其他TaskManager上恢复.作业主要记录事件ID,这些事件ID是连续的整数(例如,从0到1200000).
当我在TaskManager上检查输出时,我关闭的最后一个计数是例如500000,然后在另一个TaskManager上检查已恢复作业的输出时,它的开始值为〜400000.这意味着〜100K重复事件.此数字取决于测试速度可以更高还是更低.
不知道我是否缺少某些内容,但是我希望该作业在不同的TaskManager上恢复后显示下一个连续的数字(如500001). 有谁知道为什么会这样/我必须配置额外的设置才能获得准确的一次?

I've setup a Flink 1.2 standalone cluster with 2 JobManagers and 3 TaskManagers and I'm using JMeter to load-test it by producing Kafka messages / events which are then processed. The processing job runs on a TaskManager and it usually takes ~15K events/s.
The job has set EXACTLY_ONCE checkpointing and is persisting state and checkpoints to Amazon S3. If I shutdown the TaskManager running the job it takes a bit, a few seconds, then the job is resumed on a different TaskManager. The job mainly logs the event ids which are consecutive integers (e.g. from 0 to 1200000).
When I check the output on the TaskManager I shut down the last count is for example 500000, then when I check the output on the resumed job on a different TaskManager it starts with ~ 400000. This means ~100K of duplicated events. This number is dependent on the speed of the test can be higher or lower.
Not sure if I'm missing something but I would expect the job to display the next consecutive number (like 500001) after resuming on the different TaskManager.
Does anyone know why this is happening / extra settings I have to configure to obtain the exactly once?

推荐答案

您看到的只是一次的预期行为. Flink通过在故障情况下将检查点和重播结合在一起来实现容错.不能保证每个事件仅一次发送到管道中,而不能保证每个事件一次都影响管道的状态.

You are seeing the expected behavior for exactly-once. Flink implements fault-tolerance via a combination of checkpointing and replay in the case of failures. The guarantee is not that each event will be sent into the pipeline exactly once, but rather that each event will affect your pipeline's state exactly once.

检查点会在整个群集中创建一致的快照.在恢复期间,将还原操作员状态,并从最近的检查点重播源.

Checkpointing creates a consistent snapshot across the entire cluster. During recovery, operator state is restored and the sources are replayed from the most recent checkpoint.

有关更全面的说明,请参阅以下Artisans博客数据:

For a more thorough explanation, see this data Artisans blog post: High-throughput, low-latency, and exactly-once stream processing with Apache Flink™, or the Flink docs.

这篇关于Flink一次准确的消息处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆