Kafka多分区排序 [英] Kafka multiple partition ordering

查看:265
本文介绍了Kafka多分区排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道无法在Kafka中订购多个分区,并且只能保证组中单个使用者(单个分区)的分区顺序.但是,使用Kafka Streams 0.10,现在可以实现这一目标吗?如果我们使用时间戳功能,以便每个分区中的每个消息都保持顺序,那么在使用方,可以说使用Kafka Streams 0.10现在可行吗?假设我们收到了所有消息,我们是否无法根据消耗的时间戳对所有分区进行排序,而可能将它们转发到一个单独的主题进行消耗?

I am aware that it is not possible to order multiple partitions in Kafka and that partition ordering is only guaranteed for a single consumer within a group (for a single partition). However with Kafka Streams 0.10 is it now possible to achieve this? If we use the timestamp feature so that each message in each partition maintains the order, at the consumer side, lets say with Kafka Streams 0.10 is this now possible? Assuming we receive all messages could we not sort all the partitions based on the consumed timestamp and perhaps forward them on to a separate topic for consumption?

目前,我需要保持顺序,但这意味着只有一个分区和一个使用者线程.我想将其更改为多个分区以增加并行度,但是以某种方式使其井井有条".

At the moment I need to maintain ordering, but this means having a single partition with a single consumer thread. I wanted to change this to multiple partitions to increase parallelism but somehow 'get them in order'.

有什么想法吗?谢谢.

推荐答案

在这种情况下,您面临两个问题:

There are two problems you are facing in such a situation:

  1. 具有多个分区的Kafka主题,并且Kafka不能保证此类多分区主题的全局排序(主题的顺序).
  2. 与时间和时间戳相关的主题及其分区的消息迟到/乱序的可能性.

我知道无法在Kafka中订购多个分区,并且只能保证组中单个使用者(单个分区)的分区顺序.但是,使用Kafka Streams 0.10,现在可以实现这一目标吗?

I am aware that it is not possible to order multiple partitions in Kafka and that partition ordering is only guaranteed for a single consumer within a group (for a single partition). However with Kafka Streams 0.10 is it now possible to achieve this?

简短的回答是:不,当您从具有多个分区的Kafka主题中阅读时,仍然无法实现全局顺序.

The short answer is: No, it is still not possible to achieve global order when you are reading from Kafka topics that have multiple partitions.

此外,分区顺序"是指基于分区中消息的偏移量的分区顺序".顺序保证与消息的时间戳无关.

Also, "partition ordering" means "partition ordering based on the offsets of the messages in a partition". The ordering guarantee is not related to the timestamps of the messages.

最后,只有在max.in.flight.requests.per.connection == 1时才能保证订购:

Lastly, ordering is only guaranteed if max.in.flight.requests.per.connection == 1:

Apache Kafka文档中的

生产者配置设置: max.in.flight.requests.per.connection(默认值:5):客户端在阻塞之前将在单个连接上发送的未确认请求的最大数量.请注意,如果将此设置设置为大于1并且发送失败,则存在由于重试(即,如果启用重试)而导致消息重新排序的风险.

Producer configuration settings from the Apache Kafka documentation: max.in.flight.requests.per.connection (default: 5): The maximum number of unacknowledged requests the client will send on a single connection before blocking. Note that if this setting is set to be greater than 1 and there are failed sends, there is a risk of message re-ordering due to retries (i.e., if retries are enabled).

请注意,在这一点上,我们讨论的是消费者行为(这是您最初提出的问题的开始)和生产者行为在Kafka中的结合.

Note that at this point we are talking about a combination of consumer behavior (which is what your original question started out with) and producer behavior in Kafka.

如果我们使用时间戳记功能,以便每个分区中的每个消息都保持顺序,那么在消费者方,可以说使用Kafka Streams 0.10现在可以吗?

If we use the timestamp feature so that each message in each partition maintains the order, at the consumer side, lets say with Kafka Streams 0.10 is this now possible?

即使使用时间戳功能,我们仍然无法实现每个分区中的每个消息都保持顺序".为什么?因为可能会出现迟到/乱序的消息.

Even with the timestamp feature we still don't achieve "each message in each partition maintains the order". Why? Because of the possibility of late-arriving / out-of-order messages.

分区按偏移量排序,但不能保证按时间戳排序.在实践中,分区的以下内容是完全可能的(时间戳通常自毫秒起为毫秒):

A partition is ordered by offsets, but it is not guaranteed to be ordered by timestamps. The following contents of a partition is perfectly possible in practice (timestamps are normally milliseconds-since-the-epoch):

Partition offsets     0    1    2    3    4    5    6    7    8
Timestamps            15   16   16   17   15   18   18   19   17
                                          ^^
                                         oops, late-arriving data!

什么是迟到/乱序消息?想象一下,您的传感器遍布世界各地,所有这些传感器都可以测量其本地温度并将最新的测量结果发送给Kafka主题.一些传感器可能具有不可靠的Internet连接,因此它们的测量可能会延迟数分钟,数小时甚至数天才能到达.最终,他们的延迟测量将使它们到达Kafka,但它们将迟到".与城市中的移动电话相同:某些移动电话可能会耗尽电池/电量,需要先充电,然后某些移动电话才能在地下传输,因此可能会失去Internet连接.

What are late-arriving / out-of-order messages? Imagine you have sensors scattered all over the world, all of which measure their local temperature and send the latest measurement to a Kafka topic. Some sensors may have unreliable Internet connectivity, thus their measurements may arrive with a delay of minutes, hours, or even days. Eventually their delayed measurements will make it to Kafka, but they will arrive "late". Same for mobile phones in a city: Some may run out of battery/energy and need to be recharged before they can send their data, some may lose Internet connectivity because you're driving underground, etc.

假设我们收到了所有消息,是否不能根据消耗的时间戳对所有分区进行排序,或者将它们转发到另一个消耗的主题上?

Assuming we receive all messages could we not sort all the partitions based on the consumed timestamp and perhaps forward them on to a separate topic for consumption?

理论上可以,但是在实践中很难.对于流系统(甚至对于批处理系统,假设我们收到所有消息")实际上是具有挑战性的,尽管在这里通常可以简单地忽略迟到数据的问题.您永远不知道自己是否真的收到了所有消息",因为有可能迟到数据.如果收到延迟到达的消息,您想发生什么?重新处理/重新排序所有"消息(现在包括延迟到达的消息),还是忽略延迟到达的消息(因此计算错误的结果)?从某种意义上说,通过让所有人都进行排序"实现的任何此类全局排序要么代价高昂,要么尽力而为.

In theory yes, but in practice that's quite difficult. The assumption "we receive all messages" is actually challenging for a streaming system (even for a batch processing system, though presumably the problem of late-arriving data is often simply ignored here). You never know whether you actually have received "all messages" -- because of the possibility of late-arriving data. If you receive a late-arriving message, what do you want to happen? Re-process/re-sort "all" the messages again (now including the late-arriving message), or ignore the late-arriving message (thus computing incorrect results)? In a sense, any such global ordering achieved by "let's sort all of them" is either very costly or best effort.

这篇关于Kafka多分区排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆