云数据流故障恢复 [英] Cloud Dataflow failure recovery

查看:81
本文介绍了云数据流故障恢复的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Google Cloud Dataflow创建会话窗口,如中所述数据流模型论文.我想将未绑定的数据发送到Pub/Sub,然后以流方式在Cloud Dataflow中读取它.我想使用超时时间很大的会话窗口(30分钟至120分钟).

I would like to use Google Cloud Dataflow to create session windows as explained in the dataflow model paper. I would like to send my unbound data to Pub/Sub, then read it in Cloud Dataflow in a streaming way. I want to use session windows with big timeout (30 min to 120 min).

我的问题是:

1)如果数据流过程失败怎么办?

1) What happens if the dataflow process fails?

2)我是否会丢失存储在尚未超时的窗口中的所有数据?

2) Do I lose all data stored in windows that still have not timed out?

3)数据流提供哪些恢复机制?

3) What recovery mechanisms does Dataflow provide?

示例:

比方说,我有一个30分钟超时的会话"窗口,该窗口会触发每分钟的处理时间进行累积.假设该值是一个整数,而我只是对一个窗口中的所有值求和.假设这些键值对来自Pub/Sub:

Let's say I have a Sessions window with 30 min timeout that triggers every minute processing time with accumulation. Let's say the value is an integer and I am just summing all values in a window. Let's say that these key-value pairs are coming from Pub/Sub:

7 -> 10 (at time 0 seconds)
7 -> 20 (at time 30 seconds)
7 -> 50 (at time 65 seconds)
7 -> 60 (at time 75 seconds)

我想在60秒时窗口将触发并产生一个7 -> 30对.我还假设在120秒时该窗口将再次触发,并且由于累积触发而将生成7 -> 140对.

I suppose that at time 60 seconds the window would trigger and it will produce a 7 -> 30 pair. I also suppose that at time 120 seconds the window would trigger again and it will produce a 7 -> 140 pair since it triggers with accumulation.

我的问题是,如果在70 Dataflow失败,会发生什么?我想在70秒之前收到的3条消息已经被确认到Pub/Sub,因此不会重新发送.

My question is what happens if at time 70 Dataflow fails? I suppose that the 3 messages received before the 70-th second would have already been acked to Pub/Sub, so they won't be redelivered.

Dataflow重新启动时,会以某种方式通过键7恢复窗口状态,以便在120秒时生成7 -> 140对,还是仅生成7 -> 60对吗?

When Dataflow restarts, would it somehow restore the state of the window with key 7 so that at time 120 seconds it can produce a 7 -> 140 pair, or it will just produce a 7 -> 60 pair?

还有一个相关的问题-如果我取消数据流作业并开始一个新作业,我想新作业将不具有上一个作业的状态.有没有办法将状态转移到新工作?

Also a related question - if I cancel the dataflow job and start a new one, I suppose the new one would not have the state of the previous job. Is there a way to transfer the state to the new job?

推荐答案

Cloud Dataflow透明地处理故障.例如.只有在处理完消息并持久提交结果之后,它才会在Cloud Pubsub中确认"消息.如果数据流过程失败(我假设您指的是工作JVM崩溃,然后将其自动重新启动,而不是整个作业完全失败),它将再次连接到Pubsub,所有未确认的消息将重新发送和重新处理,包括分组到窗口等.窗口状态在所有故障中也都得到了持久保留,因此在这种情况下,它应该生成7 -> 140.

Cloud Dataflow handles failures transparently. E.g. it will only "ack" messages in Cloud Pubsub after they have been processed and the results durably committed. If the Dataflow process fails (I'm assuming you're referring, say, to a crash of a worker JVM which would then be automatically restarted, rather than complete failure of the whole job), on restart it will connect to Pubsub again and all non-acked messages will be redelivered and reprocessed, including grouping into windows etc. Window state is also durably preserved across failures, so in this case it should produce 7 -> 140.

如果您对实现此持久性感兴趣,请参见砂轮纸 -它早于Dataflow,但Dataflow在流媒体运行程序中使用相同的持久层.

If you are interested in the implementation of this persistence, please see the Millwheel paper - it predates Dataflow, but Dataflow uses the same persistence layer in the streaming runner.

Dataflow中没有面向用户的恢复机制,因为编程模型将您与处理故障的必要性隔离开来,并且运行程序负责所有必要的恢复;可见故障的唯一方法是可以多次处理记录,即,如果您在DoFn中执行任何副作用,则这些副作用必须是幂等的.

There are no user-facing recovery mechanisms in Dataflow because the programming model isolates you from the necessity to handle failures, and the runner takes care of all necessary recovery; the only way in which failures are visible is via the fact that records can be processed multiple times, i.e. if you perform any side effects in your DoFn's, those side effects must be idempotent.

当前唯一的在作业之间进行状态转移的情况是在管道更新操作期间.

Currently the only case where state transfer happens between job is during the pipeline update operation.

这篇关于云数据流故障恢复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆