不管窗口时间多长,在Apache Flink中合并两个流 [英] Combine two streams in Apache Flink regardless on window time

查看:353
本文介绍了不管窗口时间多长,在Apache Flink中合并两个流的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个要合并的数据流.问题在于,一个数据流的频率比另一个数据流的频率高得多,并且有时一个数据流根本不接收事件.是否可以使用一个流中的最后一个事件,并在即将发生的每个事件中将其与另一个流一起加入?

我发现的唯一解决方案是使用join函数,但是您必须指定一个公共窗口,可以在其中应用join函数.当一个流未接收到任何事件时,这是未到达窗口.

是否有可能将连接函数应用于来自一个流或另一个流的每个事件,并保持最后消耗的事件的状态并将该事件用于连接函数?

在此先感谢您提供任何有用的提示!

解决方案

在Flink中,有两种不同的方法可以合并或合并两个流,具体取决于每个特定用例的要求.当手动"执行此操作时, 您想将Flink的ConnectedStreamRichCoFlatMapFunctionCoProcessFunction一起使用.这两种方法都可以让您保持托管状态(即不经常更新的流中的最后一个元素),并将其与更快的流结合在一起. CoProcessFunction增加了使用计时器的功能,如果相关的话,您应该使用它来清除过期键的状态.

Flink培训站点上有一个练习,介绍了实现此类联接的不同方法:扩展联接.有关简单的示例,另请参见有关到期状态的练习. >

每个最新版本的Flink都包含附加的内置连接功能,因此,此时无需自己动手.请参阅与DataStream API 加入Table API ,然后加入SQL 以获得更多详细信息.

I have two data streams that I want to combine. The problem is that one data stream has a much higher frequency than the other and there are times where one stream is not receiving events at all. Is it possible to use the last event from the one stream and join it with the other stream on every event that is coming?

The only solution I found is using the join function, but you have to specify a common window, where you can apply the join function. This is window is not reached, when one stream is not receiving any events.

Is there a possibility to apply the join function on every event that is coming from either one stream or the other and maintain state of the last consumed event and use this event for the join function?

Thanks in advance for any helpful tips!

解决方案

There are many different approaches to combining or joining two streams in Flink, depending on requirements of each specific use case. When doing this "by hand", you want to be using Flink's ConnectedStreams with a RichCoFlatMapFunction or CoProcessFunction. Either of these will allow you to keep managed state (i.e. the last element from the infrequently updating stream), and join it with the faster stream. CoProcessFunction adds the ability to work with timers, which you should use to clear state for expired keys, if that's relevant.

There's an exercise on the Flink training site about different approaches for implementing such joins: Enrichment Joins. For a simpler example, see also the exercise about Expiring State.

Each recent release of Flink has included additional built-in join functions, so at this point it is less often necessary to roll your own. See the pages on joining with the DataStream API, joins with the Table API, and joins in SQL for more details.

这篇关于不管窗口时间多长,在Apache Flink中合并两个流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆