如何使用 Flink 对乱序事件时间流进行排序 [英] How to sort an out-of-order event time stream using Flink

查看:58
本文介绍了如何使用 Flink 对乱序事件时间流进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题涵盖如何使用 Flink SQL 对乱序流进行排序,但我更愿意使用 DataStream API.一种解决方案是使用 ProcessFunction 来做到这一点,该 ProcessFunction 使用 PriorityQueue 来缓冲事件,直到水印表明它们不再超出-顺序,但是这与 RocksDB 状态后端的性能很差(问题是每次访问 PriorityQueue 都需要整个 PriorityQueue 的 ser/de).无论使用哪个状态后端,我如何有效地执行此操作?

This question covers how to sort an out-of-order stream using Flink SQL, but I would rather use the DataStream API. One solution is to do this with a ProcessFunction that uses a PriorityQueue to buffer events until the watermark indicates they are no longer out-of-order, but this performs poorly with the RocksDB state backend (the problem is that each access to the PriorityQueue will require ser/de of the entire PriorityQueue). How can I do this efficiently regardless of which state backend is in use?

推荐答案

更好的方法(或多或少是 Flink 的 SQL 和 CEP 库在内部完成的)是将乱序流缓冲到MapState,如下:

A better approach (which is more-or-less what is done internally by Flink's SQL and CEP libraries) is to buffer the out-of-order stream in MapState, as follows:

如果您要对每个键进行独立排序,请先对流进行键控.否则,对于全局排序,通过常量对流进行键控,以便您可以使用 KeyedProcessFunction 来实现排序.

If you are sorting each key independently, then first key the stream. Otherwise, for a global sort, key the stream by a constant so that you can use a KeyedProcessFunction to implement the sorting.

在该进程函数的 open 方法中,实例化一个 MapState 对象,其中键是时间戳,值是具有相同时间戳的流元素列表.

In the open method of that process function, instantiate a MapState object, where the keys are timestamps and the values are lists of stream elements all having the same timestamp.

onElement 方法中:

  • 如果一个事件迟到了,要么放弃它,要么把它发送到一个侧输出
  • 否则,将事件附加到与其时间戳对应的地图条目
  • 为此事件的时间戳注册一个事件时间计时器

onTimer 被调用时,这个时间戳的映射中的条目准备作为排序流的一部分被释放——因为当前的水印现在表明所有较早的事件应该已经被处理.向下游发送事件后,不要忘记清除地图中的条目.

When onTimer is called, then the entries in the map for this timestamp are ready to be released as part of the sorted stream -- because the current watermark now indicates that all earlier events should have already been processed. Don't forget to clear the entry in the map after sending the events downstream.

这篇关于如何使用 Flink 对乱序事件时间流进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆