Dataflow 状态处理中的恰好一次语义 [英] Exactly-once semantics in Dataflow stateful processing

查看：28 发布时间：2021/11/11 22:35:52 google-cloud-datastore google-cloud-dataflow apache-beam

本文介绍了Dataflow 状态处理中的恰好一次语义的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们试图在流媒体设置中涵盖以下场景:

We are trying to cover the following scenario in a streaming setting:

计算自作业开始以来的用户事件的聚合(比方说计数)
用户事件的数量是无限的(因此不能只使用本地状态)

我将讨论我们正在考虑的三个选项，其中前两个选项容易丢失数据，而最后一个选项不清楚.我们想更深入地了解这最后一个.当然也欢迎其他方法.

I'll discuss three options we are considering, where the two first options are prone to dataloss and the final one is unclear. We'd like to get more insight into this final one. Alternative approaches are of course welcome too.

谢谢！

x 秒的滑动窗口
按用户 ID 分组
更新数据存储

更新数据存储意味着:

启动trx
为此用户读取数据存储区
合并新信息
数据存储写入
结束传输

数据存储条目包含等于滑动窗口时间戳的幂等性 id

The datastore entry contains an idempotency id that equals the sliding window timestamp

问题:

Windows 可以同时触发，因此可以乱序处理导致数据丢失(由 Google 确认)

Windows can be fired concurrently, and then can hence be processed out of order leading to dataloss (confirmed by Google)

x 秒的滑动窗口
按用户 ID 分组
更新数据存储

更新数据存储意味着:

预检查:检查此键窗口的状态是否为true，如果是，则跳过以下步骤
启动trx
为此用户读取数据存储区
合并新信息
数据存储写入
结束传输
存储我们处理过的这个关键窗口的状态 (true)

Pre-check: check if state for this key-window is true, if so we skip the following steps
Start trx
datastore read for this user
Merging in new info
datastore write
End trx
Store in state for this key-window that we processed it (true)

重新执行将因此跳过重复更新

Re-execution will hence skip duplicate updates

问题:

5 到 7 之间的失败不会写入本地状态，导致重新执行并可能将元素计数两次.我们可以通过使用多个状态来规避这一点，但我们仍然可以删除数据.

Failure between 5 and 7 will not write to local state, causing re-execution and potentially counting elements twice. We can circumvent this by using multiple states, but then we could still drop data.

基于文章及时(和有状态)处理Apache Beam，我们将创建:

Based on the article Timely (and Stateful) Processing with Apache Beam, we would create:

全局窗口
按用户 ID 分组
缓冲/计数有状态 DoFn 中的所有传入事件
在第一个事件后刷新 x 时间.

同花同方法一相同

问题:

对一次性处理和状态的保证尚不清楚.如果在 state 中写入一个元素并且重新执行 bundle 会发生什么?状态是否恢复到该捆绑包之前?

The guarantees for exactly-once processing and state are unclear. What would happen if an element was written in the state and a bundle would be re-executed? Is state restored to before that bundle?

非常感谢任何指向这方面文档的链接.例如.容错如何与计时器配合使用?

Any links to documentation in this regard would be very much appreciated. E.g. how does fault-tolerance work with timers?

Dataflow 状态处理中的恰好一次语义 [英] Exactly-once semantics in Dataflow stateful processing

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Dataflow 状态处理中的恰好一次语义 [英] Exactly-once semantics in Dataflow stateful processing

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭