Kafka分区和吞吐量 [英] Kafka Partition and Throughput

查看:75
本文介绍了Kafka分区和吞吐量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对kafka有介绍性的经验,我正在尝试探索其细节.

我试图了解kafka分区如何帮助提高吞吐量.我在网上找到的所有信息中;据解释,更多的分区意味着更多的并行流;这是有道理的.

但是从不同的角度来看却没有.

让我说有两个使用者,它们以给定主题每秒10条消息的速度使用数据.现在,它们不再需要从单个分区或两个不同的分区中使用;我的吞吐量将保持每秒20条消息不变.

我觉得我一定会缺少一些内部工作细节,您可以通过解释kafka分区(多个)如何帮助提高固定数量的消费者与单个kafka分区的吞吐量来帮助我.

解决方案

(在图像中; P0,P1,P2和P3是分区.使用者组A具有C1和C2使用者.C1侦听P0,P3和C2侦听P1和P2.最后,您的使用者组A将从所有分区接收数据.)

  1. 如果您的消费者组有3个消费者,并且您添加了一个新消费者,那么它将非常理想.消费者组中的消费者数量< =分区数量.
  2. 如果您的消费者组有2个消费者,并且您添加了一个新消费者,则将触发重新平衡.Kafka将为您的使用者分配一个分区.
  3. 如果这是全新的消费者组,那么kafka会将所有分区分配给该新消费者.

现在让我们假设;您的使用者是单线程的,处理一条消息大约需要1秒,那么在情况3中,您的吞吐量将是1 msg/秒.

在案例2中;这将是3 msg/秒.因为每个使用者都在监听不同的分区并处理数据.

在情况#1中;您将不会获得任何好处.

i have introductory experience with kafka and I am trying to explore its details.

I am trying to understand how kafka partitions can help improving throughput; in all information i found online; it is explained that more partition means more parallel streams; which make sense.

How ever with different point of view it does not.

lets say i have two consumers which consumes data at "10"messages per second from given topic. now no mater they are consuming from single partition or two different partitions; my throughput will remain same 20 messages per second.

i feel like i must be missing some details on inner workings can you help me by explaining how kafka partitions (more than one) can help improving throughput for fixed number of consumers Vs single kafka partition.

解决方案

https://kafka.apache.org/intro

When I started to learn kafka; I had the same question. Following explanation will help you to answer your question:

Let's say you have a topic A with 3 partitions: X, Y & Z.

First thing to understand is how data is distributed across partitions:

Producer can choose in which partition a message will go. So your producer can send message#1 to partition-X, message#2 to partition-Y and message#3 to partition-Z. In the same way, other producers can choose in which partition data will be written. If your producer does not choose a partition then kafka will choose for you. For more information; please checkout producer API. Producer should never push message#1 to partition-X, partition-Y & partition-Z. You can create replicas to provide fault-tolerance. Partitions are not replicas.

Now, a consumer subscribes to your topic. Kafka will see how many consumers are active within a consumer group. It may allocate a partition to a consumer as following:

(in the image; P0, P1, P2 and P3 are partitions. Consumer group A has C1 & C2 consumers. C1 listens to P0, P3 and C2 listens to P1 and P2. In the end, your consumer group A will receive data from all partitions.)

  1. If your consumer group had 3 consumers and you add one new consumer then it will sit ideal. No of consumers in consumer-group <= number of partitions.
  2. If your consumer group had 2 consumers and you add a new one then rebalance will be triggered. Kafka will assign one partition to your consumer.
  3. If this is brand new consumer-group then kafka will assign all partitions to this new consumer.

Now let's assume; your consumer is single-threaded and it takes about 1 second to process a message then your throughput would be 1 msg/second in case#3.

In case#2; it would be 3 msg/second. Because each consumer is listening to different partition and processing data.

In case#1; you won't get any benefit.

这篇关于Kafka分区和吞吐量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆