Apache Beam 管道中的连续状态 [英] Continuous state in Apache Beam pipeline

查看:21
本文介绍了Apache Beam 管道中的连续状态的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为数据流运行器开发光束管道.我的用例需要以下功能.

I'm developing a beam pipeline for dataflow runner. I need the below functionality in my use case.

  1. 从 Kafka 主题读取输入事件.每个 Kafka 消息值派生 [userID, Event] 对.
  2. 对于每个userID,我需要维护一个profile,并根据当前的Event,对profile 进行可能的更新 是可能的.如果 profile 更新:
    • 更新的 profile 被写入输出流.
    • 管道中该 userID 的下一个 Event 应引用更新的配置文件.
  1. Read input events from Kafka topic(s). Each Kafka message value derives [userID, Event] pair.
  2. For each userID, I need to maintain a profile and based on the current Event, a possible update to the profile is possible. If the profile is updated:
    • Updated profile is written to output stream.
    • The next Event for that userID in the pipeline should refer to the updated profile.

我想在 Beam 中使用提供的状态功能,而不依赖于外部键值存储来维护用户配置文件.这对于当前版本的 Beam (2.1.0) 和 dataflow runner 是否可行?如果我理解正确,则状态的范围仅限于单个窗口触发中的元素(即,即使对于 GlobalWindow,状态也将仅限于由触发器引起的窗口的单个触发中的元素).我在这里遗漏了什么吗?

I was thinking of using the provided state functionality in Beam, without depending on an external key-value store for maintaining the user profile. Is this feasible with the current version of beam (2.1.0) and dataflow runner? If I understand correctly the state is scoped to the elements in a single window firing (i.e even for a GlobalWindow, the state will be scoped to the elements in a single firing of the window caused by a trigger). Am I missing something here?

推荐答案

State 非常适合您的用例.

State would be perfectly appropriate for your use case.

唯一的修正是状态限定在单个窗口内,但触发器触发不影响它.因此,如果您的状态很小,您可以将其存储在全局窗口中.当新元素到达时,您可以使用状态,根据需要输出元素,并更改状态.

The only correction is that state is scoped to a single window, but trigger firings do not affect it. So, if your state is small you can store it in a global window. When a new element arrives, you can use use the state, output elements as needed, and make changes to the state.

唯一需要考虑的是,如果您有无限数量的用户 ID,状态可能会变得多大.例如,您可能希望不活动计时器在一段时间后清除旧的用户状态.

The only thing to consider would be if you have an unbounded number of user IDs, how big the state may become. For instance, you may want an inactivity timer to clear old user state after some period of time.

如果你还没有阅读它们,博客文章 使用 Apache Beam及时(和有状态) 使用 Apache Beam 进行处理很好地介绍了这些概念和 API.

If you haven't read them, the blog posts Stateful Processing with Apache Beam and Timely (and Stateful) Processing with Apache Beam provide a good introduction to these concepts and APIs.

这篇关于Apache Beam 管道中的连续状态的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆