Dataflow 状态处理中的恰好一次语义 [英] Exactly-once semantics in Dataflow stateful processing

查看:28
本文介绍了Dataflow 状态处理中的恰好一次语义的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们试图在流媒体设置中涵盖以下场景:

We are trying to cover the following scenario in a streaming setting:

  • 计算自作业开始以来的用户事件的聚合(比方说计数)
  • 用户事件的数量是无限的(因此不能只使用本地状态)

我将讨论我们正在考虑的三个选项,其中前两个选项容易丢失数据,而最后一个选项不清楚.我们想更深入地了解这最后一个.当然也欢迎其他方法.

I'll discuss three options we are considering, where the two first options are prone to dataloss and the final one is unclear. We'd like to get more insight into this final one. Alternative approaches are of course welcome too.

谢谢!

  1. x 秒的滑动窗口
  2. 按用户 ID 分组
  3. 更新数据存储

更新数据存储意味着:

  1. 启动trx
  2. 为此用户读取数据存储区
  3. 合并新信息
  4. 数据存储写入
  5. 结束传输

数据存储条目包含等于滑动窗口时间戳的幂等性 id

The datastore entry contains an idempotency id that equals the sliding window timestamp

问题:

Windows 可以同时触发,因此可以乱序处理导致数据丢失(由 Google 确认)

Windows can be fired concurrently, and then can hence be processed out of order leading to dataloss (confirmed by Google)

  1. x 秒的滑动窗口
  2. 按用户 ID 分组
  3. 更新数据存储

更新数据存储意味着:

  1. 预检查:检查此键窗口的状态是否为true,如果是,则跳过以下步骤
  2. 启动trx
  3. 为此用户读取数据存储区
  4. 合并新信息
  5. 数据存储写入
  6. 结束传输
  7. 存储我们处理过的这个关键窗口的状态 (true)
  1. Pre-check: check if state for this key-window is true, if so we skip the following steps
  2. Start trx
  3. datastore read for this user
  4. Merging in new info
  5. datastore write
  6. End trx
  7. Store in state for this key-window that we processed it (true)

重新执行将因此跳过重复更新

Re-execution will hence skip duplicate updates

问题:

5 到 7 之间的失败不会写入本地状态,导致重新执行并可能将元素计数两次.我们可以通过使用多个状态来规避这一点,但我们仍然可以删除数据.

Failure between 5 and 7 will not write to local state, causing re-execution and potentially counting elements twice. We can circumvent this by using multiple states, but then we could still drop data.

基于文章及时(和有状态)处理Apache Beam,我们将创建:

Based on the article Timely (and Stateful) Processing with Apache Beam, we would create:

  1. 全局窗口
  2. 按用户 ID 分组
  3. 缓冲/计数有状态 DoFn 中的所有传入事件
  4. 在第一个事件后刷新 x 时间.

同花同方法一相同

问题:

对一次性处理和状态的保证尚不清楚.如果在 state 中写入一个元素并且重新执行 bundle 会发生什么?状态是否恢复到该捆绑包之前?

The guarantees for exactly-once processing and state are unclear. What would happen if an element was written in the state and a bundle would be re-executed? Is state restored to before that bundle?

非常感谢任何指向这方面文档的链接.例如.容错如何与计时器配合使用?

Any links to documentation in this regard would be very much appreciated. E.g. how does fault-tolerance work with timers?

推荐答案

从您的方法 1 和方法 2 中,不清楚无序合并是一个问题还是数据丢失.我能想到以下几点.

From your Approach 1 and 2 it is unclear whether out-of-order merging is a concern or loss of data. I can think of the following.

方法 1:不要因为乱序问题立即合并会话窗口聚合.相反,将它们分开存储,经过足够长的时间后,您可以按时间戳顺序合并中间结果.

Approach 1: Don't immediately merge the session window aggregates because of out of order problem. Instead, store them separately and after sufficient amount of time, you can merge the intermediate results in timestamp order.

方法二:将状态移到事务中.这样,任何临时故障都不会让事务完成并合并数据.会话窗口聚合的后续成功处理不会导致重复计算.

Approach 2: Move the state into the transaction. This way, any temporary failure will not let the transaction complete and merge the data. Subsequent successful processing of the session window aggregates will not result in double counting.

这篇关于Dataflow 状态处理中的恰好一次语义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆