Flink 如何处理 IterativeStream 中的检查点和状态? [英] How does Flink treat checkpoints and state within IterativeStream?

查看:35
本文介绍了Flink 如何处理 IterativeStream 中的检查点和状态?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以在文档:

Flink 目前只为没有处理的作业提供处理保证迭代.在迭代作业上启用检查点会导致例外.为了在迭代程序上强制检查点用户在启用检查点时需要设置一个特殊标志:env.enableCheckpointing(interval, force = true).

Flink currently only provides processing guarantees for jobs without iterations. Enabling checkpointing on an iterative job causes an exception. In order to force checkpointing on an iterative program the user needs to set a special flag when enabling checkpointing: env.enableCheckpointing(interval, force = true).

请注意循环边缘的飞行记录(和状态与它们相关的更改)将在失败期间丢失.

Please note that records in flight in the loop edges (and the state changes associated with them) will be lost during failure.

这是指批处理作业中的迭代或迭代流,还是两者兼而有之?

Does this refer to iterations within batch jobs or to iterative streams, or both?

如果它指的是迭代流,那么在失败的情况下,以下运算符的哪种状态可用?(例如来自本次对话,关于使用 ConnectedIterativeStreams 在操作符之间共享状态并使用 .closeWith(stream.broadcast()) 终止迭代).

If it refers to iterative streams, what state of the following operators would be available in the event of failure? (Example taken from this conversation about sharing state across operators using a ConnectedIterativeStreams and terminating the iteration with .closeWith(stream.broadcast())).

DataStream<Point> input = ...
ConnectedIterativeStreams<Point, Centroids> inputsAndCentroids = input.iterate().withFeedbackType(Centroids.class)
DataStream<Centroids> updatedCentroids = inputsAndCentroids.flatMap(new MyCoFlatmap())
inputsAndCentroids.closeWith(updatedCentroids.broadcast())

class MyCoFlatmap implements CoFlatMapFunction<Point, Centroid, Centroid>{...}

如果 MyCoFlatmap 是一个 CoProcessFunction 而不是 CoFlatMapFunction(意味着它也可以保持状态),会有什么变化吗?

Would there be any change if MyCoFlatmap were to be a CoProcessFunction instead of a CoFlatMapFunction (meaning it could hold state too)?

推荐答案

该限制仅适用于 Flink 的 DataStream/Streaming API 在使用迭代时.使用 DataSet/Batch API 时,没有任何限制.

The limitation only applies to Flink's DataStream/Streaming API when using iterations. When using the DataSet/Batch API, there are no limitations.

当使用流迭代时,您实际上不会丢失操作符状态,但您可能会丢失已从操作符通过循环边缘发送回迭代头的记录.在您的示例中,如果发生故障,从 updatedCentroids 发送到 inputsAndCentroids 的记录可能会丢失.因此,在这种情况下,Flink 无法保证恰好一次处理保证.

When using streaming iterations you actually don't lose operator state but you might lose records which have been sent from an operator back to the iteration head via the loop edge. In your example, records sent from the updatedCentroids to the inputsAndCentroids might be lost in case of a failure. Hence, Flink cannot guarantee exactly once processing guarantees in this case.

其实有一个Flink 改进建议 解决了这个缺点.然而,它还没有完成.

There is actually a Flink improvement proposal which addresses this shortcoming. However, it has not been finished yet.

这篇关于Flink 如何处理 IterativeStream 中的检查点和状态?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆