将低延迟流与Flink中的多个元数据流结合(扩展) [英] Combining low-latency streams with multiple meta-data streams in Flink (enrichment)

查看:93
本文介绍了将低延迟流与Flink中的多个元数据流结合(扩展)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在针对流分析方案评估Flink,但还没有找到足够的信息来了解如何实现我们今天在旧系统中正在进行的ETL设置.

I am evaluating Flink for a streaming analytics scenario and haven't found sufficient information on how to fulfil a kind of ETL setup we are doing in a legacy system today.

一个非常常见的情况是,我们要使用键控,低吞吐量元数据流来丰富高吞吐量数据流,这符合以下情况:

A very common scenario is that we have keyed, slow throughput, meta-data streams that we want to use for enrichment on high throughput data streams, something in the line of:

这引起了两个有关Flink的问题:如何在时间窗口重叠但不相等的情况下,通过快速更新的流来充实快速移动的流(元数据可以生存几天,而数据却可以生存几分钟)?又如何使用Flink有效地将多个(最多10个)流连接在一起,比如一个数据流和九个不同的浓缩流?

This raises two questions concerning Flink: How does one enrich a fast moving stream with slowly updating streams where the time windows overlap, but are not equal (Meta-data can live for days while data lives for minutes)? And how does one efficiently join multiple (up to 10) streams with Flink, say one data stream and nine different enrichment streams?

我知道我可以使用非窗口式外部ETL缓存来实现我的ETL方案,例如使用Redis(这就是我们今天使用的缓存),但是我想看看Flink提供了什么可能性.

I am aware that I can fulfil my ETL scenario with non-windowed external ETL caches, for example with Redis (which is what we use today), but I wanted to see what possibilities Flink offers.

推荐答案

Flink具有多种可用于充实的机制.

Flink has several mechanisms that can be used for enrichment.

我将假设所有流共享一个可用于加入相应项的公用密钥.

I'm going to assume that all of the streams share a common key that can be used to join the corresponding items.

最简单的方法可能是使用 RichFlatmap 并在其open()方法中加载静态扩充数据(

The simplest approach is probably to use a RichFlatmap and load static enrichment data in its open() method (docs about rich functions). This is only suitable if the enrichment data is static, or if you are willing to restart the enrichment job whenever you want to update the enrichment data.

对于以下所述的其他方法,您应该将充实数据存储为托管的键控状态(请参见

For the other approaches described below, you should store the enrichment data as managed, keyed state (see the docs about working with state in Flink). This will enable Flink to restore and resume your enrichment job in the case of failures.

假设您要实际流入浓缩数据,那么 RichCoFlatmap 更合适.这是一个有状态的运算符,可用于合并或合并两个连接的流.但是,使用 RichCoFlatmap ,您将无法考虑流元素的计时.例如,如果担心某个流在另一流之前或之后,并希望以可重复的确定性方式执行扩展,则可以使用

Assuming you want to actually stream in the enrichment data, then a RichCoFlatmap is more appropriate. This is a stateful operator that can be used to merge or join two connected streams. However, with a RichCoFlatmap you have no ability to take the timing of the stream elements into account. If are concerned about one stream getting ahead of, or behind the other, for example, and want the enrichment to be performed in a repeatable, deterministic fashion, then using a CoProcessFunction is the right approach.

您可以在 Apache Flink培训资料.

如果要加入的流很多(例如10个),则可以级联一系列这些两个输入的 CoProcessFunction 运算符,但是,在某些时候确实变得相当尴尬.一种替代方法是使用联合运算符将所有元数据流组合在一起(请注意,这要求所有流都具有相同的类型),然后是 RichCoFlatmap CoProcessFunction将该统一的浓缩流与主要流连接起来.

If you have many streams (e.g., 10) to join, you can cascade a series of these two-input CoProcessFunction operators, but that does become, admittedly, rather awkward at some point. An alternative would be to use a union operator to combine all of the meta-data streams together (note that this requires that all the streams have the same type), followed by a RichCoFlatmap or CoProcessFunction that joins this unified enrichment stream with the primary stream.

更新:

Flink的Table和SQL API也可以用于流富集,Flink 1.4通过添加流式时间窗口内部联接来扩展此支持.请参见表API连接 SQL联接.例如:

Flink's Table and SQL APIs can also be used for stream enrichment, and Flink 1.4 expands this support by adding streaming time-windowed inner joins. See Table API joins and SQL joins. For example:

SELECT *
FROM Orders o, Shipments s
WHERE o.id = s.orderId AND
  o.ordertime BETWEEN s.shiptime - INTERVAL '4' HOUR AND s.shiptime

如果在下达订单的4个订单中发生装运,则此示例将订单与其相应的装运合并在一起.

This example joins orders with their corresponding shipments if the shipment occurred within 4 orders of the order being placed.

这篇关于将低延迟流与Flink中的多个元数据流结合(扩展)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆