Beam中的状态处理-跨窗格共享状态吗? [英] Stateful processing in Beam - is state shared across window panes?
问题描述
Apache Beam最近通过StateSpec
和@StateId
注释引入了状态单元,并在Apache Flink和Google Cloud Dataflow中提供了部分支持.
Apache Beam has recently introduced state cells, through StateSpec
and the @StateId
annotation, with partial support in Apache Flink and Google Cloud Dataflow.
我的问题是关于在窗口流上使用有状态DoFn的情况下的状态垃圾收集.通常,当窗口过期(即水印通过窗口的末端)时,跑步者将状态移除(收集垃圾).但是,请考虑以下情况:早期触发了窗格,而抛弃了触发的窗格:
My question is about state garbage collection, in the case where a stateful DoFn is used on a windowed stream. Typically, state is removed (garbage collected) by the runner when the window expires (i.e. the watermark passes the end of the window). However, consider the case where window panes are triggered early, and the fired panes are discarded:
input.apply(Window.<MyElement>into(CalendarWindows.days(1))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(10))
))
.discardingFiredPanes()
.apply(ParDo.of(new MyStatefulDofn()));
在这种情况下,是否会保留早期触发的键的状态,直到窗口过期?也就是说,同一窗口中的后续窗格是否可以访问早期窗格编写的状态?
In this case, would the state for the keys which were fired early be kept until after the window expires? i.e. would subsequent panes in the same window have access to state written by earlier panes?
推荐答案
您的触发配置不会影响对ParDo
的有状态处理的进行方式.这些元素将立即提供给您的DoFn
,而无需任何缓冲/触发,并且您的DoFn
可以直接控制何时发生输出.
Your triggering configuration does not affect how stateful processing of a ParDo
proceeds. The elements are provided immediately to your DoFn
without any buffering/triggering and your DoFn
directly controls when output occurs.
您控制输出的事实是有状态ParDo
处理与由触发器控制的Combine.perKey
之间的重要区别.这就是为什么当触发器对于您的用例而言不够丰富时,有状态ParDo
通常是一个不错的选择的原因.
The fact that you control the output is an important difference between stateful ParDo
processing and Combine.perKey
governed by triggers. This is why stateful ParDo
is often a good choice when triggers are not rich enough for your use case.
我在Beam博客上的帖子中更详细地比较了有状态的ParDo
处理与Combine
+触发器:
I compare stateful ParDo
processing with Combine
+ triggers in some more detail in my post on the Beam blog: https://beam.apache.org/blog/2017/02/13/stateful-processing.html
现在,如果有状态的ParDo
上游某处有GroupByKey
或Combine.perKey
,则输入元素将与上游的某些触发器触发相关联.但这不会影响有状态ParDo
的状态的管理方式.由于状态在各个元素之间持久存在,而窗格"只是一个元素,因此状态将一直保持到窗口完全失效为止.
Now, if there is a GroupByKey
or Combine.perKey
somewhere upstream from your stateful ParDo
, then input elements will be associated with some trigger firing from upstream. But this does not affect how the state for your stateful ParDo
is managed. As state is persisted across elements, and a "pane" is just an element, state is maintained until the window expires fully.
顺便说一句,非常好的总结引出了您的问题!
Very nice summary leading up to your question, by the way!
这篇关于Beam中的状态处理-跨窗格共享状态吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!