使用一次性一次性联接密钥进行Flink联接流 [英] Flink join streams using a disposable one-time join key

查看:83
本文介绍了使用一次性一次性联接密钥进行Flink联接流的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对在Flink上加入两个流有疑问.我使用了有时需要的两个不同的数据流加入他们.每个数据流都已标记有唯一的ID,这些ID用作这些数据流之间的连接点.没有窗口的概念,因此为了连接这两个数据流,我要做 first.connect(second).keyBy(0,0).

I have a question around joining two streams on Flink. I use two different dataflows that at some point I need to join them. Each dataflow has been tagged with a unique id which serves as the joining point between these flows. There is no concept of window so in order to connect these two dataflows I do first.connect(second).keyBy(0,0).

当我得到正确的结果时,这似乎可行,但是我的担忧是长期的.我没有明确保留任何进行联接的运算符(coFlatMap)上的状态,但是如果假设一个流(例如,第一个)提供唯一id和第二个未能提供连接的id(我想对于那些已经加入的操作员会丢弃任何内部状态)?内存/状态占用空间是否持续增长或存在某种过期机制?

This seems to work as I get the correct results, but my worries are on long term. I do not explicitly keep any state on the operator(coFlatMap) that does the join but what happens if let's say one flow (e.g first) provides the unique id and the second fails to provide the joining id (I suppose for those already joined the operator discards any kind of internal state) ? Does the memory/state footprint grows constantly or there is some kind of expiration mechanism ?

如果是这种情况,我该如何解决这个问题?还是可以建议我另一种方法?

And if this is the case how can I tackle this problem ? or alternatively can you suggest me another approach ?

推荐答案

有几种实现此连接的方法.

There are a few approaches to implement this join.

  1. 使用

  1. Use a CoProcessFunction. When the first record for a key arrives, you store it in state and register a timer that fires x minutes/hours/days later. When the second record arrives, you perform the join and clear the state. If the second record does not arrive, the onTimer() method will be called when the timer fires. At that point, you can either just clear the state and return (INNER JOIN semantics) or forward the first record padded with a null value (OUTER JOIN semantics), clear the state, and return. The timer serves as a safety net to be able to remove the state at some point. It depends on your requirements how long you want to wait for the second record to arrive.

Table API或SQL提供了一个时间窗口连接(

The Table API or SQL provides a time-windowed join (Table API, SQL) that works similar to what I described in 1. The difference is that the windowed join implementation would try to join all records (i.e., more than one from each input stream) that arrive during the join interval and hence would keep the state longer. However, once the time is past the join interval, it would clear the state.

Flink 1.6.0(将于2018年8月初发布)将包含间隔联接,其工作方式类似于表API的窗口联接(相似的逻辑,不同的名称).与自定义实现相比,状态保持的时间也比自定义实现更长,自定义实现是基于这样的假设,即每个键在每侧仅出现一次.

Flink 1.6.0 (to be released early August 2018) will include an interval join for the DataStream API which works similar to the window join of the Table API (similar logic, different name). It would also keep the state longer than the custom implementation which is based on the assumption that each key appears just once on each side.

我会选择方法1,因为它可以提高内存效率,并且仍然相当容易实现.

I would go for approach 1. because it is more memory efficient and still reasonably easy to implement.

这篇关于使用一次性一次性联接密钥进行Flink联接流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆