Beam中的状态处理-跨窗格共享状态吗? [英] Stateful processing in Beam - is state shared across window panes?

查看:137
本文介绍了Beam中的状态处理-跨窗格共享状态吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Apache Beam最近通过StateSpec@StateId注释引入了状态单元,并在Apache Flink和Google Cloud Dataflow中提供了部分支持.

Apache Beam has recently introduced state cells, through StateSpec and the @StateId annotation, with partial support in Apache Flink and Google Cloud Dataflow.

我的问题是关于在窗口流上使用有状态DoFn的情况下的状态垃圾收集.通常,当窗口过期(即水印通过窗口的末端)时,跑步者将状态移除(收集垃圾).但是,请考虑以下情况:早期触发了窗格,而抛弃了触发的窗格:

My question is about state garbage collection, in the case where a stateful DoFn is used on a windowed stream. Typically, state is removed (garbage collected) by the runner when the window expires (i.e. the watermark passes the end of the window). However, consider the case where window panes are triggered early, and the fired panes are discarded:

input.apply(Window.<MyElement>into(CalendarWindows.days(1))
  .triggering(AfterWatermark.pastEndOfWindow()
  .withEarlyFirings(
    AfterProcessingTime.pastFirstElementInPane()
      .plusDelayOf(Duration.standardMinutes(10))
  ))
  .discardingFiredPanes()
  .apply(ParDo.of(new MyStatefulDofn()));

在这种情况下,是否会保留早期触发的键的状态,直到窗口过期?也就是说,同一窗口中的后续窗格是否可以访问早期窗格编写的状态?

In this case, would the state for the keys which were fired early be kept until after the window expires? i.e. would subsequent panes in the same window have access to state written by earlier panes?

推荐答案

您的触发配置不会影响对ParDo的有状态处理的进行方式.这些元素将立即提供给您的DoFn,而无需任何缓冲/触发,并且您的DoFn可以直接控制何时发生输出.

Your triggering configuration does not affect how stateful processing of a ParDo proceeds. The elements are provided immediately to your DoFn without any buffering/triggering and your DoFn directly controls when output occurs.

您控制输出的事实是有状态ParDo处理与由触发器控制的Combine.perKey之间的重要区别.这就是为什么当触发器对于您的用例而言不够丰富时,有状态ParDo通常是一个不错的选择的原因.

The fact that you control the output is an important difference between stateful ParDo processing and Combine.perKey governed by triggers. This is why stateful ParDo is often a good choice when triggers are not rich enough for your use case.

我在Beam博客上的帖子中更详细地比较了有状态的ParDo处理与Combine +触发器:

I compare stateful ParDo processing with Combine + triggers in some more detail in my post on the Beam blog: https://beam.apache.org/blog/2017/02/13/stateful-processing.html

现在,如果有状态的ParDo上游某处有GroupByKeyCombine.perKey,则输入元素将与上游的某些触发器触发相关联.但这不会影响有状态ParDo的状态的管理方式.由于状态在各个元素之间持久存在,而窗格"只是一个元素,因此状态将一直保持到窗口完全失效为止.

Now, if there is a GroupByKey or Combine.perKey somewhere upstream from your stateful ParDo, then input elements will be associated with some trigger firing from upstream. But this does not affect how the state for your stateful ParDo is managed. As state is persisted across elements, and a "pane" is just an element, state is maintained until the window expires fully.

顺便说一句,非常好的总结引出了您的问题!

Very nice summary leading up to your question, by the way!

这篇关于Beam中的状态处理-跨窗格共享状态吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆