为什么Apache Flink需要用于事件时间处理的水印? [英] Why does Apache Flink need Watermarks for Event Time Processing?

查看:231
本文介绍了为什么Apache Flink需要用于事件时间处理的水印?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以正确解释事件时间戳和水印吗?我从文档中了解了此信息,但目前尚不清楚.现实生活中的例子或外行人的定义会有所帮助.另外,如果可能的话,请举一个示例(以及一些可以解释它的代码段).预先感谢

解决方案

下面是一个示例,说明了我们为什么需要水印以及它们如何工作.

在此示例中,我们添加了带有时间戳的事件流,这些事件有时有些乱序,如下所示.显示的数字是事件时间时间戳,指示这些事件实际发生的时间.第一个到达的事件发生在时间4,然后是事件发生在更早的时间2,依此类推:

··· 23 19 22 24 21 14 17 13 12 15 9 11 7 2 4 →

现在想象一下我们正在尝试创建一个流分类器.这意味着该应用程序可以处理流中每个事件到达时的每个事件,并发出包含相同事件但按事件时间戳排序的新流.

一些观察结果:

(1)我们的流排序器看到的第一个元素是4,但是我们不能立即将其作为已排序流的第一个元素释放.它可能是乱序到达的,更早的事件可能还没有到来.实际上,我们受益于该流的未来的一些类似神的知识,并且我们可以看到我们的流分类器应该至少等到2到达后才能产生任何结果.

结论:某些缓冲和某些延迟是必要的.

(2)如果我们做错了,我们可能会永远等待.首先,我们的应用程序从时间4开始看到一个事件,然后从时间2开始看到一个事件.时间戳小于2的事件是否会到达?可能是.也许不吧.我们可以永远等待,永远不会看到1.

结论:最终,我们必须勇于创新,并发出2作为已排序流的开始.

(3)然后,我们需要一种策略,该策略定义对于任何给定时间戳的事件何时停止等待较早事件的到来.

这正是水印的作用-它们定义何时停止等待较早的事件.

Flink中的事件时间处理取决于水印生成器,该生成器将带有时间戳的特殊元素插入流中,称为水印.

我们的流分类器何时应停止等待,并推出2以启动已分类的流?水印到达时的时间戳为2或更大时.

(4)我们可以想象用于决定如何生成水印的不同策略.

我们知道每个事件都会在某些延迟后到达,并且这些延迟会有所不同,因此某些事件比其他事件延迟得更多.一种简单的方法是假定这些延迟受某个最大延迟的限制. Flink将此策略称为有界乱序水印.可以想像出更复杂的水印方法,但是对于许多应用程序来说,固定的延迟就足够了.

如果您要构建流分类器之类的应用程序,则Flink的ProcessFunction是正确的构建块.它提供对事件时间计时器(即,根据水印的到达而触发的回调)的访问,并具有用于管理缓冲事件所需的状态的钩子,直到将其发送到下游为止.

Can someone explain Event timestamp and watermark properly. I understood it from docs, but it is not so clear. A real life example or layman definition will help. Also, if it is possible give an example ( Along with some code snippet which can explain it ).Thanks in advance

解决方案

Here's an example that illustrates why we need watermarks, and how they work.

In this example we have a stream of timestamped events that arrive somewhat out of order, as shown below. The numbers shown are event-time timestamps that indicate when these events actually occurred. The first event to arrive happened at time 4, and it is followed by an event that happened earlier, at time 2, and so on:

··· 23 19 22 24 21 14 17 13 12 15 9 11 7 2 4 →

Now imagine that we are trying create a stream sorter. This is meant to be an application that processes each event from a stream as it arrives, and emits a new stream containing the same events, but ordered by their timestamps.

Some observations:

(1) The first element our stream sorter sees is the 4, but we can't just immediately release it as the first element of the sorted stream. It may have arrived out of order, and an earlier event might yet arrive. In fact, we have the benefit of some god-like knowledge of this stream's future, and we can see that our stream sorter should wait at least until the 2 arrives before producing any results.

Conclusion: Some buffering, and some delay, is necessary.

(2) If we do this wrong, we could end up waiting forever. First our application saw an event from time 4, and then an event from time 2. Will an event with a timestamp less than 2 ever arrive? Maybe. Maybe not. We could wait forever and never see a 1.

Conclusion: Eventually we have to be courageous and emit the 2 as the start of the sorted stream.

(3) What we need then is some sort of policy that defines when, for any given timestamped event, to stop waiting for the arrival of earlier events.

This is precisely what watermarks do — they define when to stop waiting for earlier events.

Event-time processing in Flink depends on watermark generators that insert special timestamped elements into the stream, called watermarks.

When should our stream sorter stop waiting, and push out the 2 to start the sorted stream? When a watermark arrives with a timestamp of 2, or greater.

(4) We can imagine different policies for deciding how to generate watermarks.

We know that each event arrives after some delay, and that these delays vary, so some events are delayed more than others. One simple approach is to assume that these delays are bounded by some maximum delay. Flink refers to this strategy as bounded-out-of-orderness watermarking. It's easy to imagine more complex approaches to watermarking, but for many applications a fixed delay works well enough.

If you want to build an application like a stream sorter, Flink's ProcessFunction is the right building block. It provides access to event-time timers (that is, callbacks that fire based on the arrival of watermarks), and has hooks for managing the state needed for buffering events until it's their turn to be sent downstream.

这篇关于为什么Apache Flink需要用于事件时间处理的水印?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆