跨窗口触发器/触发删除重复项 [英] Remove duplicates across window triggers/firings
问题描述
比方说,我有一个由userid键控的句子,这是一个无穷无尽的集合,并且我想要一个不断更新的用户是否烦人的值,我们可以通过将他们曾经说过的所有句子都传递给用户来计算用户是否在烦人函数是anannoying().永远.
Let's say I have an unbounded pcollection of sentences keyed by userid, and I want a constantly updated value for whether the user is annoying, we can calculate whether a user is annoying by passing all of the sentences they've ever said into the funcion isAnnoying(). Forever.
我将窗口设置为具有afterElement(1)触发器的触发器,累积FiredPanes(),执行GroupByKey,然后使用发出用户ID,isAnnoying的ParDo
I set the window to global with a trigger afterElement(1), accumulatingFiredPanes(), do GroupByKey, then have a ParDo that emits userid,isAnnoying
这永远有效,不断累积每个用户的状态,等等.除了事实证明,在大多数情况下,新句子不会改变用户是否烦人,因此在大多数情况下,窗口会触发并发出用户ID ,isAnnoying元组,这是一个冗余更新,因此io是不必要的.我该如何捕捉这些重复的更新并删除,而每次出现一个确实会改变isAnnoying值的句子时仍然获得更新?
That works forever, keeps accumulating the state for each user etc. Except it turns out the vast majority of the time a new sentence does not change whether a user isAnnoying, and so most of the times the window fires and emits a userid,isAnnoying tuple it's a redundant update and the io was unnecessary. How do I catch these duplicate updates and drop while still getting an update every time a sentence comes in that does change the isAnnoying value?
推荐答案
今天,没有办法直接表示仅当合并结果更改时才输出".
Today there is no way to directly express "output only when the combined result has changed".
您可能可以应用的一种减少数据量的方法,具体取决于您的管道:使用.discardingFiredPanes()
,然后在GroupByKey
之后跟随一个立即过滤器,该过滤器将丢弃任何零值,其中零"表示标识CombineFn
的元素.我使用的事实是,Combine
的关联性要求意味着您必须能够独立计算句子的增量讨厌"度,而无需参考历史记录.
One approach that you may be able to apply to reduce data volume, depending on your pipeline: Use .discardingFiredPanes()
and then follow the GroupByKey
with an immediate filter that drops any zero values, where "zero" means the identity element of your CombineFn
. I'm using the fact that associativity requirements of Combine
mean you must be able to independently calculate the incremental "annoying-ness" of a sentence without reference to the history.
BEAM-23 时(每个键可交叉绑定可变且-ParDo
的-window状态已实现,您将能够手动维护该状态并自行实现这种仅在结果更改时发送输出"逻辑.
When BEAM-23 (cross-bundle mutable per-key-and-window state for ParDo
) is implemented, you will be able to manually maintain the state and implement this sort of "only send output when the result changes" logic yourself.
但是,我认为这种情况可能值得模型中明确考虑.它融合了触发和
However, I think this scenario likely deserves explicit consideration in the model. It blends the concepts embodied today by triggers and the accumulation mode.
这篇关于跨窗口触发器/触发删除重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!