如果leader没有死但在Kafka中无法接收消息会发生什么?单点运动? [英] What happens if the leader is not dead but unable to receive messages in Kafka? SPoF?

查看:20
本文介绍了如果leader没有死但在Kafka中无法接收消息会发生什么?单点运动?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 3 个代理,3 个分区.每个代理是一个分区的领导者,而 ISR 则是所有分区的领导者.假设我已经分别在端口 19092,29092,39092 上运行了代理.

I have 3 brokers, 3 partitions. Each broker is a leader for one partition and the ISRs for all. Let us say that I have run the brokers on the ports 19092,29092,39092 respectively.

19092 - partition 0
29092 - partition 1
39092 - partition 2

半经纪人测试:

我想这样命名!因为它只允许 OUTPUT 而不允许 INPUT

现在,我添加了以下 iptables 规则:

Now, I have add the following iptables rule:

iptables -A INPUT -p tcp --dport 29092 -j DROP

在 Producer 中:

and in the Producer:

bin/kafka-console-producer --broker-list 10.54.8.172:19092 --topic ftest

上述 iptables 规则会阻止 INPUT 访问,但不限制代理使用 Zookeeper 更新其活动性.所以zookeeper不会认为它死了,所以不会对partition 1进行leader选举.

The above iptables rule blocks INPUT access but doesn't restrict the broker from updating its aliveness with the Zookeeper. So zookeeper will not take it to be dead and so will not conduct leader election for partition 1.

但是,由于规则,生产者无法连接到它,因此引发错误.

But, the producer is not able to connect to it because of the RULE and hence throws error.

org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for ftest-1: 1778 ms has passed since batch creation plus linger time

这是我手动完成的,但可能有其他原因导致 INPUT 访问被阻止(某些恶意软件、DDoS 或其他任何东西).

This, I have done manually, but there can be other reasons why INPUT access may be blocked (some malware, DDoS or anything else).

在 iptables 规则之前:

Metadata for ftest (from broker 1: 10.54.8.172:19092/1):

 3 brokers:

  broker 2 at 10.54.8.172:29092

  broker 1 at 10.54.8.172:19092

  broker 3 at 10.54.8.172:39092

 1 topics:

  topic "ftest" with 3 partitions:

    partition 2, leader 3, replicas: 3,1,2, isrs: 3,1,2

    partition 1, leader 2, replicas: 2,3,1, isrs: 2,3,1

    partition 0, leader 1, replicas: 1,2,3, isrs: 1,2,3

在 iptables 规则之后:

Metadata for ftest (from broker 1: 10.54.8.172:19092/1):

 3 brokers:

  broker 2 at 10.54.8.172:29092

  broker 1 at 10.54.8.172:19092

  broker 3 at 10.54.8.172:39092

 1 topics:

  topic "ftest" with 3 partitions:

    partition 2, leader 3, replicas: 3,1,2, isrs: 3,1,2

    partition 1, leader 2, replicas: 2,3,1, isrs: 2

    partition 0, leader 1, replicas: 1,2,3, isrs: 1,2,3

既然只有一个leader并且它死了(从某种意义上说它不能接收任何消息),这不是单点故障吗?

Since, there is only one leader and it is dead (in the sense it cannot receive any messages), is not a single point of failure?

我认为,理想情况下,Zookeeper 之间必须有 2 种通信方式和卡夫卡经纪人.不是吗?卡夫卡允许吗?如果是这样,如何?

I think, there must ideally be 2 way communication between Zookeeper and Kafka brokers. Isn't it? Does Kafka allow it? If so, how?

此外,当 29092 因 INPUT 访问而被阻止时,其 ISR 会缩小到 1.

Also, when the 29092 is blocked for INPUT access its ISR shrinked to 1.

可能是因为收不到任何消息(心跳)来自其他 2 个经纪人.

It could be because it is not able to receive any messages (heartbeats) from the other 2 brokers.

如果它可以连接(启用 OUTPUT),那么它可以写入它们并且要确认复制,它需要 INPUT 访问权限.

If it can connect (OUTPUT is enabled), then it can write to them and for the replication to get acknowledged, it needs INPUT access.

因此 INPUT 和 OUTPUT 也应该在这里.

So both INPUT and OUTPUT should be there here also.

经纪人 29092 在这里无所不能.使系统处于不可恢复状态!

The broker 29092 is as good as nothing here. Leaving the system in an unrecoverable state!

推荐答案

了解 Kafka 如何利用 zookeeper 原语来维护和组织集群状态,可能最好地回答您的问题.

Your question is probably best answered by understanding how Kafka leverages zookeeper primitives for maintaining and organizing cluster state.

在 Kafka 中,领导选举是由作为控制器的经纪人之一精心策划的.控制器只有一个,使用zookeeper从broker中选举出来.

In Kafka, leadership election is orchestrated by one of the broker which acts as a controller. There is only one controller and it is elected among the brokers using zookeeper.

现在,每个代理在zookeeper中将自己注册为一个临时节点".因此,发起 zK 会话的代理通过使用周期性心跳(zK 术语中的滴答)来维护成员资格.如果代理未能在超时间隔内打勾,zookeeper 会删除该节点和已注册自己的 Kafka 控制器以接收该事件的通知(通过 zK 手表) 收到通知.如果失败的代理是分区的领导者,这将触发新的领导者选举.Controller 处理 leader 选举并通知所有 broker.

Now, each broker registers itself as an "ephemeral node" in zookeeper. So the broker which initiated the zK session maintains the membership by using periodic heartbeats (ticks in zK terms). If a broker fails to tick within timeout interval, zookeeper removes that node and Kafka controller which has registered itself to be notified of that event (via zK watches) gets notified. This triggers a new leader election if the failed broker is a leader for a partition. Controller handles leader election and notifies all the brokers.

所以,是的,Kafka 和 zK 之间有 2 路通信 - 但就分区领导选举而言,这不是每个代理和 zK 之间的直接 2 路通信.控制器的方式有一个中间人.

So yes, there is a 2 way communication between Kafka and zK - but this is not a direct 2 way communication between every broker and zK as far as partition leader election is concerned. There is a middleman in the way of a controller.

在您的测试中,由于控制器从未收到代理 2 失败的通知,因此代理仍然是分区 1 的领导者.

In your test, as the controller never gets notified of failure of broker 2, so that broker remains the leader of partition 1.

现在开始,我推测

输入被阻止的代理 2 无法接收元数据更新,因此它通过将 ISR 缩小到自身来保护自己.这可能 也帮忙.

Your broker 2 which has input blocked cannot receive metadata updates, so it fences itself by shrinking ISR to itself. This might help as well.

这篇关于如果leader没有死但在Kafka中无法接收消息会发生什么?单点运动?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆