在 Apache Kafka 中,为什么消费者实例不能多于分区? [英] In Apache Kafka why can't there be more consumer instances than partitions?

查看:79
本文介绍了在 Apache Kafka 中,为什么消费者实例不能多于分区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习 Kafka,在这里阅读介绍部分

https://kafka.apache.org/documentation.html#introduction

特别是关于消费者的部分.在引言的倒数第二段中,它是

<块引用>

Kafka 做得更好.通过在主题内具有并行性(分区)的概念,Kafka 是能够在消费者进程池上提供排序保证和负载平衡.这个通过将主题中的分区分配给消费者组中的消费者来实现每个分区由组中的一个消费者使用.通过这样做,我们确保消费者是该分区的唯一读取者,并按顺序消费数据.由于有很多分区这仍然可以平衡许多消费者实例的负载.但请注意,不能消费者实例多于分区.

我的困惑源于最后一句话,因为在作者描述两个消费者组和一个 4 分区主题的那段正上方的图像中,消费者实例比分区多!

消费者实例不能多于分区也是没有意义的,因为那样分区会非常小,而且为​​每个消费者实例创建新分区的开销似乎会使 Kafka 陷入困境.我知道分区用于容错和减少任何一台服务器上的负载,但是在应该能够同时处理数千个消费者的分布式系统的上下文中,上面的句子没有意义.

解决方案

好吧,要理解,需要理解几个部分.

  1. 为了提供排序总顺序,消息只能发送给一个消费者.否则效率会非常低,因为它需要等待所有消费者都收到消息,然后才能发送下一条消息:

<块引用>

然而,虽然服务器按顺序分发消息,但消息是异步传递给消费者的,因此它们可能会乱序到达不同的消费者.这实际上意味着在并行消费的情况下,消息的顺序会丢失.消息系统通常通过使用独家消费者"的概念来解决这个问题.只允许一个进程从队列中消费,但这当然意味着处理中没有并行性.

Kafka 做得更好.通过在主题中具有并行性(分区)的概念,Kafka 能够在消费者进程池上提供排序保证和负载平衡.这是通过将主题中的分区分配给消费者组中的消费者来实现的,以便每个分区都由组中的一个消费者使用.通过这样做,我们确保消费者是该分区的唯一读取者并按顺序消费数据.由于有许多分区,这仍然可以平衡许多消费者实例的负载.但是请注意,消费者实例不能多于分区.

Kafka 仅提供分区内消息的总顺序,而不提供主题中不同分区之间的总顺序.

另外你认为的性能损失(多个分区)实际上是一种性能提升,因为 Kafka 可以完全并行地执行不同分区的操作,同时等待其他分区完成.

  1. 图中显示了不同的消费者群体,但每个分区最多一个消费者的限制仅限于一个群体内.您仍然可以拥有多个消费者组.

一开始描述了两个场景:

<块引用>

如果所有消费者实例都具有相同的消费者组,那么这就像传统的队列平衡消费者负载一样.

如果所有的消费者实例都有不同的消费者组,那么这就像发布订阅一样,所有的消息都会广播给所有的消费者.

因此,您拥有的订阅者组越多,性能就越低,因为 kafka 需要将消息复制到所有这些组并保证总​​顺序.

另一方面,您拥有的组越少,分区越多,您从并行消息处理中获得的收益就越多.

I'm learning about Kafka, reading the introduction section here

https://kafka.apache.org/documentation.html#introduction

specifically the portion about Consumers. In the second to last paragraph in the Introduction it reads

Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances than partitions.

My confusion stems from that last sentence, because in the image right above that paragraph where the author depicts two consumer groups and a 4-partition topic, there are more consumer instances than partitions!

It also doesn't make sense that there can't be more consumer instances than partitions, because then partitions would be incredibly small and it seems like the overhead in creating a new partition for each consumer instance would bog down Kafka. I understand that partitions are used for fault-tolerance and reducing the load on any one server, but the sentence above does not make sense in the context of a distributed system that's supposed to be able to handle thousands of consumers at a time.

解决方案

Ok, to understand it, one needs to understand several parts.

  1. In order to provide ordering total order, the message can be sent only to one consumer. Otherwise it would be extremely inefficient, because it would need to wait for all consumers to recieve the message before sending the next one:

However, although the server hands out messages in order, the messages are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the messages is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.

Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances than partitions.

Kafka only provides a total order over messages within a partition, not between different partitions in a topic.

Also what you think is a performance penalty (multiple partitions) is actually a performance gain, as Kafka can perform actions of different partitions completely in parallel, while waiting for other partitions to finish.

  1. The picture show different consumer groups, but the limitation of maximum one consumer per partition is only within a group. You still can have multiple consumer groups.

In the beginning the two scenarios are described:

If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers.

If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.

So, the more subscriber groups you have, the lower the performance is, as kafka needs to replicate the messages to all those groups and guarantee the total order.

On the other hand, the less group, and more partitions you have the more you gain from parallizing the message processing.

这篇关于在 Apache Kafka 中,为什么消费者实例不能多于分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆