数据流状态处理中的一次语义 [英] Exactly-once semantics in Dataflow stateful processing

查看：58 发布时间：2021/4/7 20:56:18 google-cloud-datastore google-cloud-dataflow apache-beam

本文介绍了数据流状态处理中的一次语义的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们正在尝试在流媒体设置中涵盖以下情形:

We are trying to cover the following scenario in a streaming setting:

计算自工作开始以来的用户事件总数(例如计数)
用户事件数不受限制(因此，不能仅使用本地状态)

我将讨论我们正在考虑的三个选项，其中前两个选项容易造成数据丢失，最后一个选项不清楚.我们想对这最后一个有更多的了解.当然也欢迎使用其他方法.

I'll discuss three options we are considering, where the two first options are prone to dataloss and the final one is unclear. We'd like to get more insight into this final one. Alternative approaches are of course welcome too.

谢谢！

滑动窗口x秒
按用户ID分组
更新数据存储区

更新数据存储区将意味着:

Update datastore would mean:

启动trx
为此用户读取的数据存储区
合并新信息
数据存储写入
结束trx

数据存储项包含一个等于滑动窗口时间戳的幂等ID

The datastore entry contains an idempotency id that equals the sliding window timestamp

问题:

Windows可以同时启动，因此可以无序处理，从而导致数据丢失(由Google确认)

Windows can be fired concurrently, and then can hence be processed out of order leading to dataloss (confirmed by Google)

滑动窗口x秒
按用户ID分组
更新数据存储区

更新数据存储区将意味着:

Update datastore would mean:

预检查:检查此键窗口的状态是否为 true ，如果是，则跳过以下步骤
启动trx
为此用户读取的数据存储区
合并新信息
数据存储写入
结束trx
以状态状态存储此已处理的键窗口( true )

Pre-check: check if state for this key-window is true, if so we skip the following steps
Start trx
datastore read for this user
Merging in new info
datastore write
End trx
Store in state for this key-window that we processed it (true)

因此，重新执行将跳过重复的更新

Re-execution will hence skip duplicate updates

问题:

5到7之间的故障不会写入本地状态，从而导致重新执行并可能对元素计数两次.我们可以通过使用多个状态来规避这一点，但是我们仍然可以删除数据.

Failure between 5 and 7 will not write to local state, causing re-execution and potentially counting elements twice. We can circumvent this by using multiple states, but then we could still drop data.

基于文章及时(有状态)处理Apache Beam ，我们将创建:

Based on the article Timely (and Stateful) Processing with Apache Beam, we would create:

一个全局窗口
按用户ID分组
缓冲/计数有状态DoFn中的所有传入事件
在第一个事件发生后刷新x时间.

刷新与方法1相同

问题:

不清楚一次处理和状态的保证.如果在状态中写入了一个元素，并且束将被重新执行，将会发生什么?状态是否已还原到该捆绑包之前?

The guarantees for exactly-once processing and state are unclear. What would happen if an element was written in the state and a bundle would be re-executed? Is state restored to before that bundle?

任何与文档相关的链接将不胜感激.例如.容错如何与计时器一起使用?

Any links to documentation in this regard would be very much appreciated. E.g. how does fault-tolerance work with timers?

数据流状态处理中的一次语义 [英] Exactly-once semantics in Dataflow stateful processing

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

数据流状态处理中的一次语义 [英] Exactly-once semantics in Dataflow stateful processing

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭