Apache Flink - 将流与输入 Kafka 主题一样进行分区 [英] Apache Flink - Partitioning the stream equally as the input Kafka topic

查看:27
本文介绍了Apache Flink - 将流与输入 Kafka 主题一样进行分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 Apache Flink 中实现以下场景:

I would like to implement in Apache Flink the following scenario:

给定一个有 4 个分区的 Kafka 主题,我想在 Flink 中使用不同的逻辑独立处理分区内数据,具体取决于事件的类型.

Given a Kafka topic having 4 partitions, I would like to process the intra-partition data independently in Flink using different logics, depending on the event's type.

特别地,假设输入 Kafka 主题包含之前图像中描述的事件.每个事件都有不同的结构:分区 1 具有字段a";作为键,分区 2 具有字段b";作为关键等.在 Flink 中,我想根据事件应用不同的业务逻辑,所以我想我应该以某种方式拆分流.为了实现图片中描述的内容,我想只使用一个消费者来做类似的事情(我不明白为什么我应该使用更多):

In particular, suppose the input Kafka topic contains the events depicted in the previous images. Each event have a different structure: partition 1 has the field "a" as key, partition 2 has the field "b" as key, etc. In Flink I would like to apply different business logics depending on the events, so I thought I should split the stream in some way. To achieve what's described in the picture, I thought to do something like that using just one consumer (I don't see why I should use more):

FlinkKafkaConsumer<..> consumer = ...
DataStream<..> stream = flinkEnv.addSource(consumer);

stream.keyBy("a").map(new AEventMapper()).addSink(...);
stream.keyBy("b").map(new BEventMapper()).addSink(...);
stream.keyBy("c").map(new CEventMapper()).addSink(...);
stream.keyBy("d").map(new DEventMapper()).addSink(...);

(a) 正确吗? 另外,如果我想并行处理每个 Flink 分区,因为我只想按顺序处理按相同 Kafka 分区排序的事件,而不是在全球范围内考虑它们,(b) 我该怎么办?我知道方法 setParallelism() 的存在,但我不知道在这种情况下在哪里应用它.

(a) Is it correct? Also, if I would like to process each Flink partition in parallel, since I'm just interested to process in-order the events sorted by the same Kafka partition, and not considering them globally, (b) how can I do? I know the existence of the method setParallelism(), but I don't know where to apply it in this scenario.

我正在寻找有关标记为(a)(b) 的问题的答案.提前致谢.

I'm looking for an answer about questions marked (a) and (b). Thank you in advance.

推荐答案

如果你能这样构建它,它的性能会更好:

If you can build it like this, it will perform better:

具体来说,我提议的是

  1. 将整个作业的并行度设置为与 Kafka 分区的数量完全匹配.然后每个 FlinkKafkaConsumer 实例将从一个分区中读取.

  1. Set the parallelism of the entire job to exactly match the number of Kafka partitions. Then each FlinkKafkaConsumer instance will read from exactly one partition.

如果可能,请避免使用 keyBy,并避免更改并行度.然后 source、map 和 sink 将全部链接在一起(这称为 operator chaining),并且不需要序列化/反序列化和网络(在 Flink 中).这不仅会表现良好,而且您还可以利用细粒度恢复(并行并行的流式作业可以在不中断其他任务的情况下恢复一个失败的任务).

If possible, avoid using keyBy, and avoid changing the parallelism. Then the source, map, and sink will all be chained together (this is called operator chaining), and no serialization/deserialization and no networking will be needed (within Flink). Not only will this perform well, but you can also take advantage of fine-grained recovery (streaming jobs that are embarrassingly parallel can recover one failed task without interrupting the others).

您可以编写一个通用的 EventMapper,用于检查正在处理的事件类型,然后执行适当的操作.或者,您可以尝试巧妙地实现一个 RichMapFunction,它在其 open() 中找出正在处理的分区,并加载适当的映射器.

You can write a general purpose EventMapper that checks to see what type of event is being processed, and then does whatever is appropriate. Or you can try to be clever and implement a RichMapFunction that in its open() figures out which partition is being handled, and loads the appropriate mapper.

这篇关于Apache Flink - 将流与输入 Kafka 主题一样进行分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆