Kafka多分区排序 [英] Kafka multiple partition ordering

查看:32
本文介绍了Kafka多分区排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道无法在 Kafka 中对多个分区进行排序,并且分区排序仅对组内的单个使用者(对于单个分区)有保证.但是,使用 Kafka Streams 0.10 现在可以实现这一目标吗?如果我们使用时间戳功能,以便每个分区中的每条消息都保持顺序,那么在消费者方面,可以说使用 Kafka Streams 0.10 现在有可能吗?假设我们收到所有消息,我们是否可以不根据消费的时间戳对所有分区进行排序,然后将它们转发到单独的主题以供消费?

目前我需要维护排序,但这意味着有一个单独的分区和一个消费者线程.我想将其更改为多个分区以增加并行性,但以某种方式按顺序排列".

有什么想法吗?谢谢.

解决方案

在这种情况下,您面临两个问题:

  1. 具有多个分区的 Kafka 主题,事实上 Kafka 不保证此类多分区主题的全局排序(主题的).
  2. 主题及其分区的消息延迟到达/乱序的可能性,这与时间和时间戳有关.

<块引用>

我知道无法在 Kafka 中对多个分区进行排序,并且分区排序仅对组内的单个使用者(对于单个分区)有保证.但是,现在使用 Kafka Streams 0.10 是否可以实现这一目标?

简短的回答是:不,当您从具有多个分区的 Kafka 主题中读取时,仍然无法实现全局顺序.

此外,分区排序"是指基于分区中消息偏移量的分区排序".排序保证与消息的时间戳无关.

最后,只有在 max.in.flight.requests.per.connection == 1 时才能保证排序:

<块引用>Apache Kafka 文档中的

生产者配置设置:max.in.flight.requests.per.connection(默认值:5):客户端在阻塞前将在单个连接上发送的最大未确认请求数.请注意,如果此设置设置为大于 1 且发送失败,则存在由于重试(即启用重试)而导致消息重新排序的风险.

请注意,此时我们正在讨论 Kafka 中消费者行为(这是您最初提出的问题)和生产者行为的组合.

<块引用>

如果我们使用时间戳功能使每个分区中的每条消息都保持顺序,那么在消费者方面,可以说在 Kafka Streams 0.10 中这现在可能吗?

即使有时间戳功能,我们仍然没有实现每个分区中的每条消息都保持顺序".为什么?因为可能会出现迟到/乱序的消息.

分区按偏移量排序,但不保证按时间戳排序.分区的以下内容在实践中是完全可能的(时间戳通常是自纪元以来的毫秒数):

分区偏移量 0 1 2 3 4 5 6 7 8时间戳 15 16 16 17 15 18 18 19 17^^哎呀,迟到的数据!

什么是迟到/乱序消息?想象一下,你有遍布世界各地的传感器,所有传感器都测量当地的温度,并将最新的测量值发送到 Kafka 主题.某些传感器的 Internet 连接可能不可靠,因此它们的测量值可能会延迟几分钟、几小时甚至几天才能到达.最终,他们延迟的测量将到达 Kafka,但他们将迟到"到达.城市中的手机也是如此:有些手机可能会耗尽电池/能源,需要充电才能发送数据,有些手机可能会因为您在地下行驶而无法连接互联网等.

<块引用>

假设我们收到所有消息,我们是否不能根据消费的时间戳对所有分区进行排序,并可能将它们转发到单独的主题以供消费?

理论上是的,但在实践中这是相当困难的.我们收到所有消息"的假设对于流式系统实际上是具有挑战性的(即使对于批处理系统也是如此,但据推测,延迟到达数据的问题通常在这里被简单地忽略了).你永远不知道你是否真的收到了所有消息"——因为数据可能会迟到.如果您收到一条迟到的消息,您希望发生什么?再次重新处理/重新排序所有"消息(现在包括迟到的消息),还是忽略迟到的消息(从而计算出不正确的结果)?从某种意义上说,通过让我们对所有这些排序"实现的任何此类全局排序要么非常昂贵,要么是尽最大努力.

I am aware that it is not possible to order multiple partitions in Kafka and that partition ordering is only guaranteed for a single consumer within a group (for a single partition). However with Kafka Streams 0.10 is it now possible to achieve this? If we use the timestamp feature so that each message in each partition maintains the order, at the consumer side, lets say with Kafka Streams 0.10 is this now possible? Assuming we receive all messages could we not sort all the partitions based on the consumed timestamp and perhaps forward them on to a separate topic for consumption?

At the moment I need to maintain ordering, but this means having a single partition with a single consumer thread. I wanted to change this to multiple partitions to increase parallelism but somehow 'get them in order'.

Any thoughts? thank you.

解决方案

There are two problems you are facing in such a situation:

  1. A Kafka topic that has multiple partitions, and the fact Kafka does not guarantee global ordering (of the topic) for such multi-partition topics.
  2. The possibility of late-arriving / out-of-order messages for the topic and its partitions, which is related to time and timestamps.

I am aware that it is not possible to order multiple partitions in Kafka and that partition ordering is only guaranteed for a single consumer within a group (for a single partition). However with Kafka Streams 0.10 is it now possible to achieve this?

The short answer is: No, it is still not possible to achieve global order when you are reading from Kafka topics that have multiple partitions.

Also, "partition ordering" means "partition ordering based on the offsets of the messages in a partition". The ordering guarantee is not related to the timestamps of the messages.

Lastly, ordering is only guaranteed if max.in.flight.requests.per.connection == 1:

Producer configuration settings from the Apache Kafka documentation: max.in.flight.requests.per.connection (default: 5): The maximum number of unacknowledged requests the client will send on a single connection before blocking. Note that if this setting is set to be greater than 1 and there are failed sends, there is a risk of message re-ordering due to retries (i.e., if retries are enabled).

Note that at this point we are talking about a combination of consumer behavior (which is what your original question started out with) and producer behavior in Kafka.

If we use the timestamp feature so that each message in each partition maintains the order, at the consumer side, lets say with Kafka Streams 0.10 is this now possible?

Even with the timestamp feature we still don't achieve "each message in each partition maintains the order". Why? Because of the possibility of late-arriving / out-of-order messages.

A partition is ordered by offsets, but it is not guaranteed to be ordered by timestamps. The following contents of a partition is perfectly possible in practice (timestamps are normally milliseconds-since-the-epoch):

Partition offsets     0    1    2    3    4    5    6    7    8
Timestamps            15   16   16   17   15   18   18   19   17
                                          ^^
                                         oops, late-arriving data!

What are late-arriving / out-of-order messages? Imagine you have sensors scattered all over the world, all of which measure their local temperature and send the latest measurement to a Kafka topic. Some sensors may have unreliable Internet connectivity, thus their measurements may arrive with a delay of minutes, hours, or even days. Eventually their delayed measurements will make it to Kafka, but they will arrive "late". Same for mobile phones in a city: Some may run out of battery/energy and need to be recharged before they can send their data, some may lose Internet connectivity because you're driving underground, etc.

Assuming we receive all messages could we not sort all the partitions based on the consumed timestamp and perhaps forward them on to a separate topic for consumption?

In theory yes, but in practice that's quite difficult. The assumption "we receive all messages" is actually challenging for a streaming system (even for a batch processing system, though presumably the problem of late-arriving data is often simply ignored here). You never know whether you actually have received "all messages" -- because of the possibility of late-arriving data. If you receive a late-arriving message, what do you want to happen? Re-process/re-sort "all" the messages again (now including the late-arriving message), or ignore the late-arriving message (thus computing incorrect results)? In a sense, any such global ordering achieved by "let's sort all of them" is either very costly or best effort.

这篇关于Kafka多分区排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆