Apache Beam管道中的连续状态 [英] Continuous state in Apache Beam pipeline
问题描述
我正在为数据流运行器开发光束管道.我的用例需要以下功能.
I'm developing a beam pipeline for dataflow runner. I need the below functionality in my use case.
- 从Kafka主题读取输入事件.每个Kafka消息值派生
[userID, Event]
对. - 对于每个
userID
,我需要维护一个profile
,并根据当前的Event
,可能对profile
进行更新.如果profile
已更新:- 已更新的
profile
被写入输出流. - 管道中该
userID
的下一个Event
应该引用更新后的配置文件.
- 已更新的
- Read input events from Kafka topic(s). Each Kafka message value derives
[userID, Event]
pair. - For each
userID
, I need to maintain aprofile
and based on the currentEvent
, a possible update to theprofile
is possible. If theprofile
is updated:- Updated
profile
is written to output stream. - The next
Event
for thatuserID
in the pipeline should refer to the updated profile.
- Updated
我当时正在考虑在Beam中使用提供的状态功能,而不依赖于用于维护用户个人资料的外部键值存储.当前版本的Beam(2.1.0
)和dataflow
流道是否可行?如果我正确理解,状态将范围限定在单个窗口触发中的元素(即即使对于GlobalWindow
,状态也将范围限定在由触发器引起的窗口单个触发中的元素).我在这里想念什么吗?
I was thinking of using the provided state functionality in Beam, without depending on an external key-value store for maintaining the user profile. Is this feasible with the current version of beam (2.1.0
) and dataflow
runner? If I understand correctly the state is scoped to the elements in a single window firing (i.e even for a GlobalWindow
, the state will be scoped to the elements in a single firing of the window caused by a trigger). Am I missing something here?
推荐答案
状态将完全适合您的用例.
State would be perfectly appropriate for your use case.
唯一的修正是状态范围仅限于单个窗口,但触发器触发不会影响它.因此,如果您的状态很小,则可以将其存储在全局窗口中.当新元素到达时,您可以使用状态,根据需要输出元素,并更改状态.
The only correction is that state is scoped to a single window, but trigger firings do not affect it. So, if your state is small you can store it in a global window. When a new element arrives, you can use use the state, output elements as needed, and make changes to the state.
唯一要考虑的是,如果您拥有无限数量的用户ID,则状态可能会变大.例如,您可能希望不活动的计时器在一段时间后清除旧用户状态.
The only thing to consider would be if you have an unbounded number of user IDs, how big the state may become. For instance, you may want an inactivity timer to clear old user state after some period of time.
如果您还没有阅读它们,那么博客文章使用Apache Beam进行状态处理和及时(有状态) )使用Apache Beam进行处理很好地介绍了这些概念和API.
If you haven't read them, the blog posts Stateful Processing with Apache Beam and Timely (and Stateful) Processing with Apache Beam provide a good introduction to these concepts and APIs.
这篇关于Apache Beam管道中的连续状态的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!