如何使用Flink对无序的事件时间流进行排序 [英] How to sort an out-of-order event time stream using Flink

查看:1269
本文介绍了如何使用Flink对无序的事件时间流进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题涵盖如何使用Flink SQL对无序流进行排序,但是我宁愿使用DataStream API. 一个解决方案是通过使用ProcessFunction来做到这一点的,该ProcessFunction使用PriorityQueue来缓冲事件,直到水印指示它们不再超出-顺序,但这在RocksDB状态后端的执行效果很差(问题是,对PriorityQueue的每次访问都需要整个PriorityQueue的ser/de).无论使用哪个状态后端,如何都能有效地做到这一点?

This question covers how to sort an out-of-order stream using Flink SQL, but I would rather use the DataStream API. One solution is to do this with a ProcessFunction that uses a PriorityQueue to buffer events until the watermark indicates they are no longer out-of-order, but this performs poorly with the RocksDB state backend (the problem is that each access to the PriorityQueue will require ser/de of the entire PriorityQueue). How can I do this efficiently regardless of which state backend is in use?

推荐答案

更好的方法(Flink的SQL和CEP库在内部完成的工作或多或少)是将乱序流缓冲到MapState,如下:

A better approach (which is more-or-less what is done internally by Flink's SQL and CEP libraries) is to buffer the out-of-order stream in MapState, as follows:

如果要分别对每个键进行排序,请先对流进行键设置.否则,对于全局排序,请使用常量对流进行键控,以便可以使用KeyedProcessFunction来实现排序.

If you are sorting each key independently, then first key the stream. Otherwise, for a global sort, key the stream by a constant so that you can use a KeyedProcessFunction to implement the sorting.

在该流程函数的open方法中,实例化一个MapState对象,其中的键是时间戳,值是都具有相同时间戳的流元素的列表.

In the open method of that process function, instantiate a MapState object, where the keys are timestamps and the values are lists of stream elements all having the same timestamp.

onElement方法中:

  • 如果事件晚了,请将其丢弃或发送到侧面输出
  • 否则,将事件附加到对应于其时间戳的地图条目
  • 为该事件的时间戳记注册一个事件时间计时器

当调用onTimer时,该时间戳记的映射中的条目已准备好作为已排序流的一部分被释放-因为当前的水印现在指示所有早先的事件都应该已经被处理.向下游发送事件后,请不要忘记清除地图中的条目.

When onTimer is called, then the entries in the map for this timestamp are ready to be released as part of the sorted stream -- because the current watermark now indicates that all earlier events should have already been processed. Don't forget to clear the entry in the map after sending the events downstream.

这篇关于如何使用Flink对无序的事件时间流进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆