Kafka 分区和吞吐量 [英] Kafka Partition and Throughput

查看:32
本文介绍了Kafka 分区和吞吐量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 kafka 的入门经验,我正在尝试探索它的细节.

我想了解 kafka 分区如何帮助提高吞吐量;在我在网上找到的所有信息中;说明更多的分区意味着更多的并行流;这是有道理的.

从不同的角度来看,它不会.

假设我有两个消费者,它们以每秒10"条来自给定主题的消息的速度消耗数据.现在无论他们是从单个分区还是两个不同的分区消费;我的吞吐量将保持不变,每秒 20 条消息.

我觉得我一定遗漏了一些关于内部工作的细节,你能帮助我解释 kafka 分区(多个)如何帮助提高固定数量的消费者与单个 kafka 分区的吞吐量.

解决方案

(图中;P0、P1、P2和P3是分区.消费者组A有C1&C2消费者.C1听P0,P3和C2听P1和P2.最后,你的消费者组A将从所有分区接收数据.)

  1. 如果您的消费者组有 3 个消费者,而您又添加了 1 个新消费者,那么这将是理想的选择.消费者组中的消费者数量<=分区数量.
  2. 如果您的消费者组有 2 个消费者,而您添加了一个新消费者,则将触发重新平衡.Kafka 将为您的消费者分配一个分区.
  3. 如果这是全新的消费者组,那么 kafka 会将所有分区分配给这个新消费者.

现在让我们假设;您的使用者是单线程的,处理一条消息大约需要 1 秒,那么在 case#3 中,您的吞吐量将为 1 msg/秒.

情况#2;这将是 3 味精/秒.因为每个消费者都在监听不同的分区并处理数据.

情况#1;你不会得到任何好处.

i have introductory experience with kafka and I am trying to explore its details.

I am trying to understand how kafka partitions can help improving throughput; in all information i found online; it is explained that more partition means more parallel streams; which make sense.

How ever with different point of view it does not.

lets say i have two consumers which consumes data at "10"messages per second from given topic. now no mater they are consuming from single partition or two different partitions; my throughput will remain same 20 messages per second.

i feel like i must be missing some details on inner workings can you help me by explaining how kafka partitions (more than one) can help improving throughput for fixed number of consumers Vs single kafka partition.

解决方案

https://kafka.apache.org/intro

When I started to learn kafka; I had the same question. Following explanation will help you to answer your question:

Let's say you have a topic A with 3 partitions: X, Y & Z.

First thing to understand is how data is distributed across partitions:

Producer can choose in which partition a message will go. So your producer can send message#1 to partition-X, message#2 to partition-Y and message#3 to partition-Z. In the same way, other producers can choose in which partition data will be written. If your producer does not choose a partition then kafka will choose for you. For more information; please checkout producer API. Producer should never push message#1 to partition-X, partition-Y & partition-Z. You can create replicas to provide fault-tolerance. Partitions are not replicas.

Now, a consumer subscribes to your topic. Kafka will see how many consumers are active within a consumer group. It may allocate a partition to a consumer as following:

(in the image; P0, P1, P2 and P3 are partitions. Consumer group A has C1 & C2 consumers. C1 listens to P0, P3 and C2 listens to P1 and P2. In the end, your consumer group A will receive data from all partitions.)

  1. If your consumer group had 3 consumers and you add one new consumer then it will sit ideal. No of consumers in consumer-group <= number of partitions.
  2. If your consumer group had 2 consumers and you add a new one then rebalance will be triggered. Kafka will assign one partition to your consumer.
  3. If this is brand new consumer-group then kafka will assign all partitions to this new consumer.

Now let's assume; your consumer is single-threaded and it takes about 1 second to process a message then your throughput would be 1 msg/second in case#3.

In case#2; it would be 3 msg/second. Because each consumer is listening to different partition and processing data.

In case#1; you won't get any benefit.

这篇关于Kafka 分区和吞吐量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆