Beam 中的状态处理 - 状态是否在窗格之间共享? [英] Stateful processing in Beam - is state shared across window panes?

查看:34
本文介绍了Beam 中的状态处理 - 状态是否在窗格之间共享?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Apache Beam 最近通过 StateSpec@StateId 注释引入了状态单元,部分支持 Apache Flink 和 Google Cloud Dataflow.

Apache Beam has recently introduced state cells, through StateSpec and the @StateId annotation, with partial support in Apache Flink and Google Cloud Dataflow.

我的问题是关于状态垃圾收集的,在有状态 DoFn 用于窗口流的情况下.通常,当窗口到期(即水印通过窗口末尾)时,运行程序会删除(垃圾收集)状态.但是,考虑一下窗口窗格被提前触发,而被触发的窗格被丢弃的情况:

My question is about state garbage collection, in the case where a stateful DoFn is used on a windowed stream. Typically, state is removed (garbage collected) by the runner when the window expires (i.e. the watermark passes the end of the window). However, consider the case where window panes are triggered early, and the fired panes are discarded:

input.apply(Window.<MyElement>into(CalendarWindows.days(1))
  .triggering(AfterWatermark.pastEndOfWindow()
  .withEarlyFirings(
    AfterProcessingTime.pastFirstElementInPane()
      .plusDelayOf(Duration.standardMinutes(10))
  ))
  .discardingFiredPanes()
  .apply(ParDo.of(new MyStatefulDofn()));

在这种情况下,早期触发的键的状态是否会保留到窗口到期后?即同一窗口中的后续窗格是否可以访问由早期窗格写入的状态?

In this case, would the state for the keys which were fired early be kept until after the window expires? i.e. would subsequent panes in the same window have access to state written by earlier panes?

推荐答案

您的触发配置不会影响 ParDo 的有状态处理的进行方式.元素会立即提供给您的 DoFn,无需任何缓冲/触发,您的 DoFn 直接控制何时发生输出.

Your triggering configuration does not affect how stateful processing of a ParDo proceeds. The elements are provided immediately to your DoFn without any buffering/triggering and your DoFn directly controls when output occurs.

您控制输出这一事实是有状态的 ParDo 处理和由触发器控制的 Combine.perKey 之间的重要区别.这就是为什么当触发器对于您的用例来说不够丰富时,有状态的 ParDo 通常是一个不错的选择.

The fact that you control the output is an important difference between stateful ParDo processing and Combine.perKey governed by triggers. This is why stateful ParDo is often a good choice when triggers are not rich enough for your use case.

我在 Beam 博客上的帖子中更详细地比较了有状态 ParDo 处理与 Combine + 触发器:https://beam.apache.org/blog/2017/02/13/stateful-processing.html

I compare stateful ParDo processing with Combine + triggers in some more detail in my post on the Beam blog: https://beam.apache.org/blog/2017/02/13/stateful-processing.html

现在,如果在有状态的 ParDo 上游某处有 GroupByKeyCombine.perKey,那么输入元素将与一些相关联触发从上游发射.但这并不影响您的有状态 ParDo 的状态是如何管理的.由于状态是跨元素持久化的,而窗格"只是一个元素,状态会一直保持到窗口完全过期.

Now, if there is a GroupByKey or Combine.perKey somewhere upstream from your stateful ParDo, then input elements will be associated with some trigger firing from upstream. But this does not affect how the state for your stateful ParDo is managed. As state is persisted across elements, and a "pane" is just an element, state is maintained until the window expires fully.

顺便提一下,非常好的总结导致了您的问题!

Very nice summary leading up to your question, by the way!

这篇关于Beam 中的状态处理 - 状态是否在窗格之间共享?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆