为什么 Apache Flink 需要 Watermarks 来进行事件时间处理? [英] Why does Apache Flink need Watermarks for Event Time Processing?

查看:22
本文介绍了为什么 Apache Flink 需要 Watermarks 来进行事件时间处理?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

谁能正确解释事件时间戳和水印.我从文档中理解它,但不是很清楚.现实生活中的例子或外行定义会有所帮助.另外,如果可以的话,请举一个例子(连同一些可以解释它的代码片段).提前致谢

Can someone explain Event timestamp and watermark properly. I understood it from docs, but it is not so clear. A real life example or layman definition will help. Also, if it is possible give an example ( Along with some code snippet which can explain it ).Thanks in advance

推荐答案

以下示例说明了我们为什么需要水印,以及它们的工作原理.

Here's an example that illustrates why we need watermarks, and how they work.

在这个例子中,我们有一个带时间戳的事件流,这些事件的到达顺序有点乱,如下所示.显示的数字是事件时间时间戳,指示这些事件实际发生的时间.第一个到达的事件发生在时间 4,然后是更早发生的事件,时间 2,依此类推:

In this example we have a stream of timestamped events that arrive somewhat out of order, as shown below. The numbers shown are event-time timestamps that indicate when these events actually occurred. The first event to arrive happened at time 4, and it is followed by an event that happened earlier, at time 2, and so on:

··· 23 19 22 24 21 14 17 13 12 15 9 11 7 2 4 →

现在假设我们正在尝试创建一个流排序器.这是一个应用程序,它处理来自流的每个事件到达时,并发出一个包含相同事件的新流,但按其时间戳排序.

Now imagine that we are trying create a stream sorter. This is meant to be an application that processes each event from a stream as it arrives, and emits a new stream containing the same events, but ordered by their timestamps.

一些观察:

(1) 我们的流排序器看到的第一个元素是 4,但我们不能立即将它作为排序流的第一个元素释放.它可能已经无序到达,并且可能还有更早的事件到达.事实上,我们对这个流的未来有了一些上帝般的了解,我们可以看到我们的流排序器应该至少等到 2 到达才会产生任何结果.

(1) The first element our stream sorter sees is the 4, but we can't just immediately release it as the first element of the sorted stream. It may have arrived out of order, and an earlier event might yet arrive. In fact, we have the benefit of some god-like knowledge of this stream's future, and we can see that our stream sorter should wait at least until the 2 arrives before producing any results.

结论:一些缓冲和一些延迟是必要的.

(2) 如果我们做错了,我们可能会永远等待.首先,我们的应用程序看到了时间 4 的事件,然后是时间 2 的事件.时间戳小于 2 的事件会到达吗?也许.也许不吧.我们可以永远等待,永远看不到 1.

(2) If we do this wrong, we could end up waiting forever. First our application saw an event from time 4, and then an event from time 2. Will an event with a timestamp less than 2 ever arrive? Maybe. Maybe not. We could wait forever and never see a 1.

结论:最终我们必须勇敢地发出 2 作为排序流的开始.

(3) 然后我们需要某种策略来定义对于任何给定的带时间戳的事件,何时停止等待较早事件的到来.

(3) What we need then is some sort of policy that defines when, for any given timestamped event, to stop waiting for the arrival of earlier events.

这正是水印的作用——它们定义何时停止等待较早的事件.

This is precisely what watermarks do — they define when to stop waiting for earlier events.

Flink 中的事件时间处理依赖于水印生成器,它们将特殊的时间戳元素插入到流中,称为水印.

Event-time processing in Flink depends on watermark generators that insert special timestamped elements into the stream, called watermarks.

我们的流排序器何时应该停止等待,并推出 2 以启动排序流?当水印以 2 或更大的时间戳到达时.

When should our stream sorter stop waiting, and push out the 2 to start the sorted stream? When a watermark arrives with a timestamp of 2, or greater.

(4) 我们可以想象不同的策略来决定如何生成水印.

(4) We can imagine different policies for deciding how to generate watermarks.

我们知道每个事件都会在一些延迟后到达,并且这些延迟各不相同,因此有些事件比其他事件延迟得更多.一种简单的方法是假设这些延迟受到某个最大延迟的限制.Flink 将此策略称为 bounded-out-of-orderness 水印.很容易想象更复杂的水印方法,但对于许多应用程序来说,固定延迟就足够了.

We know that each event arrives after some delay, and that these delays vary, so some events are delayed more than others. One simple approach is to assume that these delays are bounded by some maximum delay. Flink refers to this strategy as bounded-out-of-orderness watermarking. It's easy to imagine more complex approaches to watermarking, but for many applications a fixed delay works well enough.

如果你想构建一个像流排序器这样的应用程序,Flink 的 ProcessFunction 是正确的构建块.它提供对事件时间计时器(即根据水印到达触发的回调)的访问,并具有用于管理缓冲事件所需状态的钩子,直到轮到它们被发送到下游为止.

If you want to build an application like a stream sorter, Flink's ProcessFunction is the right building block. It provides access to event-time timers (that is, callbacks that fire based on the arrival of watermarks), and has hooks for managing the state needed for buffering events until it's their turn to be sent downstream.

这篇关于为什么 Apache Flink 需要 Watermarks 来进行事件时间处理?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆