使用 Flink 实现 ETL 作业时如何保留记录顺序? [英] How to preserve order of records when implementing an ETL job with Flink?

查看:24
本文介绍了使用 Flink 实现 ETL 作业时如何保留记录顺序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我想用 Flink 实现一个 ETL 作业,其中的 source 和 sink 都是 Kafka topic,只有一个分区.
source 和 sink 中的记录顺序对下游很重要(有更多的作业消耗我的 ETL 的 sink,作业由其他团队维护.).
有没有办法保证sink中记录的顺序与source相同,并使并行度大于1?

Suppose I want to implement an ETL job with Flink, source and sink of which are both Kafka topic with only one partition.
Order of records in source and sink matters to downstream(There are more jobs consume sink of my ETL, jobs are maintained by other teams.).
Is there any way make sure order of records in sink same as source, and make parallelism more than 1?

推荐答案

https://stackoverflow.com/a/69094404/2000823 涵盖了您问题的部分内容.基本原则是,只要两个事件在执行图中采用相同的路径,它们就会保持相对顺序.否则,事件将相互竞争,并且无法保证顺序.

https://stackoverflow.com/a/69094404/2000823 covers parts of your question. The basic principle is that two events will maintain their relative ordering so long as they take the same path through the execution graph. Otherwise, the events will race against each other, and there is no guarantee regarding ordering.

如果您的作业在任务之间只有 FORWARD 连接,那么将始终保留顺序.如果您使用 keyBy 或 rebalance(更改并行),则不会.

If your job only has FORWARD connections between the tasks, then the order will always be preserved. If you use keyBy or rebalance (to change the parallel), then it will not.

不能并行读取(或写入)具有一个分区的 Kafka 主题.您可以增加作业的并行度,但这只会对中间任务产生有意义的影响(因为在这种情况下源和接收器不能并行操作)——这会导致事件以乱序结束的可能性.

A Kafka topic with one partition cannot be read from (or written to) in parallel. You can increase the parallelism of the job, but this will only have a meaningful effect on intermediate tasks (since in this case the source and sink cannot operate in parallel) -- which then introduces the possibility of events ending up out-of-order.

如果在一个键一个键的基础上维护排序就足够了,那么只有一个分区,你总是没问题的.由于多个分区被并行使用,那么如果您使用 keyBy(或 SQL 中的 GROUP BY),则只有当一个键的所有事件始终在同一个 Kafka 分区中时,您才会没事.

If it's enough to maintain the ordering on a key-by-key basis, then with just one partition, you'll always be fine. With multiple partitions being consumed in parallel, then if you use keyBy (or GROUP BY in SQL), you'll be okay only if all events for a key are always in the same Kafka partition.

这篇关于使用 Flink 实现 ETL 作业时如何保留记录顺序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆