Apache Flink-将流等同于输入Kafka主题进行分区 [英] Apache Flink - Partitioning the stream equally as the input Kafka topic

查看:68
本文介绍了Apache Flink-将流等同于输入Kafka主题进行分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在Apache Flink中实现以下场景:

I would like to implement in Apache Flink the following scenario:

鉴于Kafka主题有4个分区,我想根据事件的类型使用不同的逻辑在Flink中独立处理分区内数据.

Given a Kafka topic having 4 partitions, I would like to process the intra-partition data independently in Flink using different logics, depending on the event's type.

尤其是,假设输入的Kafka主题包含先前图像中描述的事件.每个事件具有不同的结构:分区1具有字段" a ".作为键,分区2具有字段" b ".在Flink中,我想根据事件应用不同的业务逻辑,所以我认为我应该以某种方式拆分流.为了实现图片中描述的内容,我想只使用一个消费者来做类似的事情(我不明白为什么我应该使用更多):

In particular, suppose the input Kafka topic contains the events depicted in the previous images. Each event have a different structure: partition 1 has the field "a" as key, partition 2 has the field "b" as key, etc. In Flink I would like to apply different business logics depending on the events, so I thought I should split the stream in some way. To achieve what's described in the picture, I thought to do something like that using just one consumer (I don't see why I should use more):

FlinkKafkaConsumer<..> consumer = ...
DataStream<..> stream = flinkEnv.addSource(consumer);

stream.keyBy("a").map(new AEventMapper()).addSink(...);
stream.keyBy("b").map(new BEventMapper()).addSink(...);
stream.keyBy("c").map(new CEventMapper()).addSink(...);
stream.keyBy("d").map(new DEventMapper()).addSink(...);

(a)正确吗?另外,如果我想并行处理每个Flink分区,因为我只想按顺序处理由同一Kafka分区排序的事件,而不是全局考虑它们,(b)我该怎么做?我知道方法 setParallelism()的存在,但是我不知道在这种情况下该方法的应用范围.

(a) Is it correct? Also, if I would like to process each Flink partition in parallel, since I'm just interested to process in-order the events sorted by the same Kafka partition, and not considering them globally, (b) how can I do? I know the existence of the method setParallelism(), but I don't know where to apply it in this scenario.

我正在寻找有关标记为(a)(b)的问题的答案.预先谢谢你.

I'm looking for an answer about questions marked (a) and (b). Thank you in advance.

推荐答案

如果可以这样构建它,它将表现得更好:

If you can build it like this, it will perform better:

具体地说,我的建议是

  1. 将整个作业的并行性设置为与Kafka分区的数量完全匹配.然后,每个 FlinkKafkaConsumer 实例将只从一个分区读取.

  1. Set the parallelism of the entire job to exactly match the number of Kafka partitions. Then each FlinkKafkaConsumer instance will read from exactly one partition.

如果可能,避免使用 keyBy ,并避免更改并行性.然后,源,地图和接收器将全部链接在一起(这称为 operator chaining ),并且不需要序列化/反序列化,也不需要联网(在Flink内).这样不仅效果良好,而且还可以利用细粒度的恢复(令人尴尬的并行流作业可以恢复一个失败的任务而不会中断其他任务).

If possible, avoid using keyBy, and avoid changing the parallelism. Then the source, map, and sink will all be chained together (this is called operator chaining), and no serialization/deserialization and no networking will be needed (within Flink). Not only will this perform well, but you can also take advantage of fine-grained recovery (streaming jobs that are embarrassingly parallel can recover one failed task without interrupting the others).

您可以编写一个通用的EventMapper,以检查正在处理的事件类型,然后执行适当的操作.或者,您可以尝试变得聪明些,并实现一个 RichMapFunction ,该代码在其 open()中可以确定正在处理的分区,并加载适当的映射器.

You can write a general purpose EventMapper that checks to see what type of event is being processed, and then does whatever is appropriate. Or you can try to be clever and implement a RichMapFunction that in its open() figures out which partition is being handled, and loads the appropriate mapper.

这篇关于Apache Flink-将流等同于输入Kafka主题进行分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆