在Apache Kafka中,为什么使用者实例不能多于分区? [英] In Apache Kafka why can't there be more consumer instances than partitions?

查看:163
本文介绍了在Apache Kafka中,为什么使用者实例不能多于分区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在了解Kafka,请在此处阅读介绍部分

I'm learning about Kafka, reading the introduction section here

https://kafka.apache.org/documentation.html#introduction

特别是关于消费者的部分。在引言的倒数第二段中,

specifically the portion about Consumers. In the second to last paragraph in the Introduction it reads


Kafka做得更好。通过在主题内具有并行性(即分区)的概念,Kafka能够在消费者进程池中提供订购保证和负载平衡。该
是通过将主题中的分区分配给消费者组中的消费者而实现的,因此每个分区中的
恰好由组中的一个消费者使用。通过这样做,我们确保
使用者是该分区的唯一读取器,并按顺序使用数据。由于有许多
分区,因此仍然可以平衡许多使用者实例的负载。但是请注意,
的使用者实例不能超过分区。

Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances than partitions.

我的困惑源于最后一句话,因为在在该段上面笔者描绘了两个消费者组和一个4个分区的主题的图像上方,消费者实例比分区更多!

My confusion stems from that last sentence, because in the image right above that paragraph where the author depicts two consumer groups and a 4-partition topic, there are more consumer instances than partitions!

使用者实例的数量不能超过分区,因为这样分区会非常小,并且似乎为每个使用者实例创建新分区的开销会使Kafka陷入困境。我知道分区用于容错并减少任何一台服务器上的负载,但是上面的句子在应该能够同时处理数千个消费者的分布式系统的上下文中没有意义。

It also doesn't make sense that there can't be more consumer instances than partitions, because then partitions would be incredibly small and it seems like the overhead in creating a new partition for each consumer instance would bog down Kafka. I understand that partitions are used for fault-tolerance and reducing the load on any one server, but the sentence above does not make sense in the context of a distributed system that's supposed to be able to handle thousands of consumers at a time.

推荐答案

好,要理解它,一个人需要理解几个部分。

Ok, to understand it, one needs to understand several parts.


  1. 为了提供订购总订单,该消息只能发送给一位消费者。否则,它将效率极低,因为它将需要等待所有消费者接收到该消息,然后再发送下一条消息:



<但是,尽管服务器按顺序分发消息,但是消息是异步传递给使用者的,因此它们可能会在不同的使用者上无序到达。这有效地意味着在存在并行消耗的情况下,消息的顺序丢失。消息系统通常通过具有专用消费者的概念来解决此问题。这样一来,只有一个进程可以从队列中使用,但这当然意味着在处理中没有并行性。

However, although the server hands out messages in order, the messages are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the messages is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.

Kafka可以做得更好。通过在主题内具有并行性(即分区)的概念,Kafka能够在用户进程池中提供排序保证和负载均衡。这是通过将主题中的分区分配给消费者组中的消费者来实现的,以便每个分区都由组中的一个消费者完全消费。通过这样做,我们确保使用者是该分区的唯一读取器,并按顺序使用数据。由于存在许多分区,因此仍然可以平衡许多使用者实例上的负载。但是请注意,使用者实例不能超过分区。

Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances than partitions.

Kafka仅按分区中的消息提供总顺序,而不是主题中不同分区之间的消息。

您还认为性能损失(多个分区)实际上是性能的提高,因为Kafka可以完全并行执行不同分区的操作,而等待其他分区完成。

Also what you think is a performance penalty (multiple partitions) is actually a performance gain, as Kafka can perform actions of different partitions completely in parallel, while waiting for other partitions to finish.


  1. 图片显示了不同的使用者组,但是每个分区最多只能有一个使用者的限制组。您仍然可以有多个消费者组。

在开始时描述了两种情况:

In the beginning the two scenarios are described:


如果所有使用者实例都具有相同的使用者组,则其工作原理就类似于传统的使用者队列均衡负载。

If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers.

如果所有使用者实例具有不同的使用者组,然后,这就像发布-订阅,所有消息都广播给所有使用者。

If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.

因此,您拥有的订户组越多,性能越低,因为kafka需要将消息复制到所有这些组并保证总​​订单。

So, the more subscriber groups you have, the lower the performance is, as kafka needs to replicate the messages to all those groups and guarantee the total order.

另一方面,组越少,分区越多,通过对消息进行并行处理可以获得的收益就越多处理。

On the other hand, the less group, and more partitions you have the more you gain from parallizing the message processing.

这篇关于在Apache Kafka中,为什么使用者实例不能多于分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆