数据流状态处理中的一次语义 [英] Exactly-once semantics in Dataflow stateful processing

查看:58
本文介绍了数据流状态处理中的一次语义的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在尝试在流媒体设置中涵盖以下情形:

We are trying to cover the following scenario in a streaming setting:

  • 计算自工作开始以来的用户事件总数(例如计数)
  • 用户事件数不受限制(因此,不能仅使用本地状态)

我将讨论我们正在考虑的三个选项,其中前两个选项容易造成数据丢失,最后一个选项不清楚.我们想对这最后一个有更多的了解.当然也欢迎使用其他方法.

I'll discuss three options we are considering, where the two first options are prone to dataloss and the final one is unclear. We'd like to get more insight into this final one. Alternative approaches are of course welcome too.

谢谢!

  1. 滑动窗口x秒
  2. 按用户ID分组
  3. 更新数据存储区

更新数据存储区将意味着:

Update datastore would mean:

  1. 启动trx
  2. 为此用户读取的数据存储区
  3. 合并新信息
  4. 数据存储写入
  5. 结束trx

数据存储项包含一个等于滑动窗口时间戳的幂等ID

The datastore entry contains an idempotency id that equals the sliding window timestamp

问题:

Windows可以同时启动,因此可以无序处理,从而导致数据丢失(由Google确认)

Windows can be fired concurrently, and then can hence be processed out of order leading to dataloss (confirmed by Google)

  1. 滑动窗口x秒
  2. 按用户ID分组
  3. 更新数据存储区

更新数据存储区将意味着:

Update datastore would mean:

  1. 预检查:检查此键窗口的状态是否为 true ,如果是,则跳过以下步骤
  2. 启动trx
  3. 为此用户读取的数据存储区
  4. 合并新信息
  5. 数据存储写入
  6. 结束trx
  7. 以状态状态存储此已处理的键窗口( true )
  1. Pre-check: check if state for this key-window is true, if so we skip the following steps
  2. Start trx
  3. datastore read for this user
  4. Merging in new info
  5. datastore write
  6. End trx
  7. Store in state for this key-window that we processed it (true)

因此,重新执行将跳过重复的更新

Re-execution will hence skip duplicate updates

问题:

5到7之间的故障不会写入本地状态,从而导致重新执行并可能对元素计数两次.我们可以通过使用多个状态来规避这一点,但是我们仍然可以删除数据.

Failure between 5 and 7 will not write to local state, causing re-execution and potentially counting elements twice. We can circumvent this by using multiple states, but then we could still drop data.

基于文章及时(有状态)处理Apache Beam ,我们将创建:

Based on the article Timely (and Stateful) Processing with Apache Beam, we would create:

  1. 一个全局窗口
  2. 按用户ID分组
  3. 缓冲/计数有状态DoFn中的所有传入事件
  4. 在第一个事件发生后刷新x时间.

刷新与方法1相同

问题:

不清楚一次处理和状态的保证.如果在状态中写入了一个元素,并且束将被重新执行,将会发生什么?状态是否已还原到该捆绑包之前?

The guarantees for exactly-once processing and state are unclear. What would happen if an element was written in the state and a bundle would be re-executed? Is state restored to before that bundle?

任何与文档相关的链接将不胜感激.例如.容错如何与计时器一起使用?

Any links to documentation in this regard would be very much appreciated. E.g. how does fault-tolerance work with timers?

推荐答案

根据方法1和方法2,目前尚不清楚无序合并是数据问题还是数据丢失.我可以想到以下内容.

From your Approach 1 and 2 it is unclear whether out-of-order merging is a concern or loss of data. I can think of the following.

方法1:由于出现乱序问题,请勿立即合并会话窗口聚合.而是将它们分开存储,并在足够的时间后,可以按时间戳顺序合并中间结果.

Approach 1: Don't immediately merge the session window aggregates because of out of order problem. Instead, store them separately and after sufficient amount of time, you can merge the intermediate results in timestamp order.

方法2:将状态转移到事务中.这样,任何暂时的失败都不会让事务完成并合并数据.随后成功处理会话窗口集合将不会导致重复计算.

Approach 2: Move the state into the transaction. This way, any temporary failure will not let the transaction complete and merge the data. Subsequent successful processing of the session window aggregates will not result in double counting.

这篇关于数据流状态处理中的一次语义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆