Flink广播状态如何初始化? [英] How could Flink broadcast state be initialized?

查看:53
本文介绍了Flink广播状态如何初始化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在尝试构建一个用例,其中流中的数据通过计算公式运行,但是公式本身也应该(很少)是可更新的.通过阅读文档,在我看来Flink广播状态自然适合这种情况.

We're trying to build a use case where data from a stream is run through a calculation formula, but the formula itself should also (rarely) be updateable. From reading the documentation, it seems to me that Flink broadcast state would be a natural fit for a case like this.

作为一个实验,我构建了一个简化的版本:假设我有一个整数流,第二个流包含这些整数的乘法因子(我可以随意发送值).第二个流的频率非常低,很容易在事件之间间隔几天或几周的时间.现在,这两个都被实现为简单的套接字服务器,最终产品将使用Kafka.

As an experiment, I've built a simplified version: suppose I have a stream of integers, and a second stream containing multiplication factors for those integers (where I can send values at will). The second stream is very low frequency, could easily be in the order of days or weeks between events. For now these both are implemented as simple socket servers, the end product would use Kafka.

在我的示例应用程序中,这一切都可行,但是我有一个问题:当系统启动并且广播流还没有发生任何事情时会发生什么?从哪里可以得到默认(或上次使用)的因子?在我的示例中,我已经通过对值进行硬编码解决了这一问题,但这不是我可以使用的东西.

In my example application this all works, but I'm left with one problem: what happens when the system starts and nothing has happened on the broadcasted stream yet? Where could I get the default (or last used) factor from? In my example I've solved it by hard coding a value for now, but that's not something I could use.

在我的实验项目中,我对此感到有些困惑,因为{processElement}仅获得只读的广播状态,但是在进行更新之前,将不会调用 processBroadcastElement 很长时间.我的计划是将使用的公式存储在数据库中,并在作业(重新)开始时以某种方式读取它,但是我还没有找到一种使这项工作有效的方法.来自知识渊博的人们的任何建议都将受到欢迎,这是我的第一个Flink项目,因此我试图找到解决方法.

In my experimental project I'm a bit stumped by this, as {processElement} only gets a read-only broadcast state, but processBroadcastElement won't be called until there's an update which could take a long time. My plan was to store the formulae used in a database and somehow read it in when the job (re)starts but I haven't found a way to make this work. Any suggestions from more knowledgeable people would be welcome, this is my first Flink project so I'm trying to find my way around.

工作示例在这里: https://github.com/tonvanbart/flink-broadcast-example/树/mapstate尝试Flink代码在 BroadcastState 类中.

The working example is here: https://github.com/tonvanbart/flink-broadcast-example/tree/mapstate-attempt The Flink code is in class BroadcastState.

谢谢.

推荐答案

如果系统正在从检查点/保存点重新启动,那么您拥有(通过状态)广播的最后一个因素,对吗?因此,我认为问题在于最初启动该怎么办.

If the system is restarting from a checkpoint/savepoint, then you have the last factor that was broadcast (via state), right? So I assume the issue is what to do when it initially starts up.

如果是这样,那么这就是您使用的模式的一个常见问题,您实际上要阻止整数流,直到从广播流中获得初始值为止.

If so, then this is a common problem with the pattern you're using, where you effectively want to block the stream of integers until you've gotten the initial value from the broadcast stream.

现在,常见的解决方案是将整数流缓冲在操作符中(使用状态),直到获得该初始值为止,但这可能会导致无界状态,具体取决于输入整数的速度以及需要多长时间.等待.

Right now the common solution is to buffer the integer stream in your operator (using state) until you get that initial value, but this can result in unbounded state depending on how fast integers are coming in, and how long you have to wait.

您可以尝试的其他方法是包装您的整数源(使其成为委托),并且在您知道已广播某些内容之前不发出任何值.例如.将广播的内容设置为可查询状态,并进行定期检查,直到该状态存在为止.

Something else you can try is to wrap your integer source (make it a delegate) and don't emit any values until you know that something has been broadcast. E.g. make what's broadcast into queryable state, and do a periodic check until the state exists.

这篇关于Flink广播状态如何初始化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆