Flink 的广播状态如何初始化? [英] How could Flink broadcast state be initialized?

查看:213
本文介绍了Flink 的广播状态如何初始化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在尝试构建一个用例,其中来自流的数据通过计算公式运行,但公式本身也应该(很少)可更新.通过阅读文档,在我看来 Flink 广播状态很适合这种情况.

We're trying to build a use case where data from a stream is run through a calculation formula, but the formula itself should also (rarely) be updateable. From reading the documentation, it seems to me that Flink broadcast state would be a natural fit for a case like this.

作为一个实验,我构建了一个简化版本:假设我有一个整数流,第二个流包含这些整数的乘法因子(我可以随意发送值).第二个流的频率非常低,很容易在事件之间以几天或几周的顺序出现.目前这两个都是作为简单的套接字服务器实现的,最终产品将使用 Kafka.

As an experiment, I've built a simplified version: suppose I have a stream of integers, and a second stream containing multiplication factors for those integers (where I can send values at will). The second stream is very low frequency, could easily be in the order of days or weeks between events. For now these both are implemented as simple socket servers, the end product would use Kafka.

在我的示例应用程序中,这一切都有效,但我遇到了一个问题:当系统启动并且广播流上还没有发生任何事情时会发生什么?我可以从哪里获得默认(或上次使用)因子?在我的示例中,我暂时通过硬编码一个值来解决它,但这不是我可以使用的.

In my example application this all works, but I'm left with one problem: what happens when the system starts and nothing has happened on the broadcasted stream yet? Where could I get the default (or last used) factor from? In my example I've solved it by hard coding a value for now, but that's not something I could use.

在我的实验项目中,我对此感到有些困惑,因为 {processElement} 仅获得只读广播状态,但在有可能需要的更新之前不会调用 processBroadcastElement很长时间.我的计划是将使用的公式存储在数据库中,并在工作(重新)开始时以某种方式读取它,但我还没有找到一种方法来完成这项工作.欢迎更多知识渊博的人提出任何建议,这是我的第一个 Flink 项目,所以我正在努力寻找解决方法.

In my experimental project I'm a bit stumped by this, as {processElement} only gets a read-only broadcast state, but processBroadcastElement won't be called until there's an update which could take a long time. My plan was to store the formulae used in a database and somehow read it in when the job (re)starts but I haven't found a way to make this work. Any suggestions from more knowledgeable people would be welcome, this is my first Flink project so I'm trying to find my way around.

工作示例在这里:https://github.com/tonvanbart/flink-broadcast-example/树/地图状态尝试Flink 代码在 BroadcastState 类中.

The working example is here: https://github.com/tonvanbart/flink-broadcast-example/tree/mapstate-attempt The Flink code is in class BroadcastState.

提前致谢.

推荐答案

如果系统从检查点/保存点重新启动,那么您拥有广播的最后一个因素(通过状态),对吗?所以我假设问题是它最初启动时要做什么.

If the system is restarting from a checkpoint/savepoint, then you have the last factor that was broadcast (via state), right? So I assume the issue is what to do when it initially starts up.

如果是这样,那么这是您使用的模式的一个常见问题,您实际上希望阻止整数流,直到您从广播流中获得初始值.

If so, then this is a common problem with the pattern you're using, where you effectively want to block the stream of integers until you've gotten the initial value from the broadcast stream.

现在常见的解决方案是在您的操作符中缓冲整数流(使用状态)直到您获得初始值,但这可能会导致无界状态,具体取决于整数进入的速度以及您需要多长时间等等.

Right now the common solution is to buffer the integer stream in your operator (using state) until you get that initial value, but this can result in unbounded state depending on how fast integers are coming in, and how long you have to wait.

您可以尝试的其他方法是包装您的整数源(使其成为委托)并且在您知道某些内容已被广播之前不要发出任何值.例如.将广播到可查询状态,并定期检查直到状态存在.

Something else you can try is to wrap your integer source (make it a delegate) and don't emit any values until you know that something has been broadcast. E.g. make what's broadcast into queryable state, and do a periodic check until the state exists.

这篇关于Flink 的广播状态如何初始化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆