在 Flink 流中使用静态 DataSet 丰富 DataStream [英] Enriching DataStream using static DataSet in Flink streaming
问题描述
我正在编写一个 Flink 流程序,其中我需要使用一些静态数据集(信息库,IB)来丰富用户事件的 DataStream.
I am writing a Flink streaming program in which I need to enrich a DataStream of user events using some static data set (information base, IB).
例如假设我们有一个买家的静态数据集,我们有一个传入的事件点击流,对于每个事件,我们想要添加一个布尔标志,指示事件的执行者是否是买家.
For E.g. Let's say we have a static data set of buyers and we have an incoming clickstream of events, for each event we want to add a boolean flag indicating whether the doer of the event is a buyer or not.
实现此目的的理想方法是按用户 ID 对传入流进行分区,让数据集中的买方设置再次按用户 ID 进行分区,然后在此数据集中查找流中的每个事件.
An ideal way to achieve this would be to partition the incoming stream by user id, have the buyers set available in a DataSet partitioned again by user id and then do a look up for each event in the stream into this DataSet.
由于 Flink 不允许在流式程序中使用 DataSets,我该如何实现上述功能?
Since Flink does not allow using DataSets in a streaming program, how can I achieve the above ?
另一种选择可能是使用托管运营商状态来存储买家集,但我如何保持此状态按用户 ID 分发,以避免在单个事件查找中进行网络输入/输出?在内存状态后端的情况下,状态是由某个键保持分布,还是在所有操作员子任务中复制?
Another option could be to use Managed Operator State to store buyers set, but how can I keep this state distributed by user id so as to avoid network i/o in individual event look ups ? In case of memory state backend, does state remain distributed by some key, or is it replicated across all operator subtasks ?
在 Flink 流程序中实现上述丰富需求的正确设计模式是什么?
What is the right design pattern to achieve the above enriching requirement in a Flink streaming program ?
推荐答案
我会通过 user_id 对流进行键控,并使用 RichFlatMap 进行充实.在 RichFlatMap 的 open() 方法中,您可以为该用户加载静态买家标志并将其缓存在布尔字段中.
I would key the stream by user_id, and use a RichFlatMap to do the enrichment. In the open() method of the RichFlatMap you can load the static buyer flag for that user and keep it cached in a boolean field.
这篇关于在 Flink 流中使用静态 DataSet 丰富 DataStream的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!