Apache Beam管道中的连续状态 [英] Continuous state in Apache Beam pipeline

查看:82
本文介绍了Apache Beam管道中的连续状态的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为数据流运行器开发光束管道.我的用例需要以下功能.

I'm developing a beam pipeline for dataflow runner. I need the below functionality in my use case.

  1. 从Kafka主题读取输入事件.每个Kafka消息值派生[userID, Event]对.
  2. 对于每个userID,我需要维护一个profile,并根据当前的Event,可能对profile进行更新.如果profile已更新:
    • 已更新的profile被写入输出流.
    • 管道中该userID的下一个Event应该引用更新后的配置文件.
  1. Read input events from Kafka topic(s). Each Kafka message value derives [userID, Event] pair.
  2. For each userID, I need to maintain a profile and based on the current Event, a possible update to the profile is possible. If the profile is updated:
    • Updated profile is written to output stream.
    • The next Event for that userID in the pipeline should refer to the updated profile.

我当时正在考虑在Beam中使用提供的状态功能,而不依赖于用于维护用户个人资料的外部键值存储.当前版本的Beam(2.1.0)和dataflow流道是否可行?如果我正确理解,状态将范围限定在单个窗口触发中的元素(即即使对于GlobalWindow,状态也将范围限定在由触发器引起的窗口单个触发中的元素).我在这里想念什么吗?

I was thinking of using the provided state functionality in Beam, without depending on an external key-value store for maintaining the user profile. Is this feasible with the current version of beam (2.1.0) and dataflow runner? If I understand correctly the state is scoped to the elements in a single window firing (i.e even for a GlobalWindow, the state will be scoped to the elements in a single firing of the window caused by a trigger). Am I missing something here?

推荐答案

状态将完全适合您的用例.

State would be perfectly appropriate for your use case.

唯一的修正是状态范围仅限于单个窗口,但触发器触发不会影响它.因此,如果您的状态很小,则可以将其存储在全局窗口中.当新元素到达时,您可以使用状态,根据需要输出元素,并更改状态.

The only correction is that state is scoped to a single window, but trigger firings do not affect it. So, if your state is small you can store it in a global window. When a new element arrives, you can use use the state, output elements as needed, and make changes to the state.

唯一要考虑的是,如果您拥有无限数量的用户ID,则状态可能会变大.例如,您可能希望不活动的计时器在一段时间后清除旧用户状态.

The only thing to consider would be if you have an unbounded number of user IDs, how big the state may become. For instance, you may want an inactivity timer to clear old user state after some period of time.

如果您还没有阅读它们,那么博客文章使用Apache Beam进行状态处理及时(有状态) )使用Apache Beam进行处理很好地介绍了这些概念和API.

If you haven't read them, the blog posts Stateful Processing with Apache Beam and Timely (and Stateful) Processing with Apache Beam provide a good introduction to these concepts and APIs.

这篇关于Apache Beam管道中的连续状态的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆