如果一个 kafka 节点宕机,整个集群就会失败? [英] Whole cluster failing if one kafka node goes down?

查看:179
本文介绍了如果一个 kafka 节点宕机,整个集群就会失败?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 3 个节点的 kafka 集群,每个集群都有 zookeeper 和 kafka.如果我明确地杀死了zookeeper和kafka的领导节点,整个集群就不会接受任何传入的数据并等待节点回来.

I have 3 node kafka cluster each having zookeeper and kafka. If i explicitly kill the leader node both zookeeper and kafka the whole cluster is not accepting any incoming data and waiting for the node to come back.

kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 3 min.insync.replicas=2 --partitions 6 --topic logs

使用上述命令创建的主题.

topic created using the above command.

节点 1

server.properties

server.properties

broker.id=0
listeners=PLAINTEXT://:9092
advertised.listeners=PLAINTEXT://10.0.2.4:9092
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/tmp/kafka-logs
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=localhost:2181,10.0.2.5:2181,10.0.14.7:2181
zookeeper.connection.timeout.ms=18000
group.initial.rebalance.delay.ms=0

zookeeper.properties

zookeeper.properties

tickTime=2000 
dataDir=/tmp/zookeeper/ 
initLimit=5 
syncLimit=2 
server.0=0.0.0.0:2888:3888
server.1=analyzer1:2888:3888
server.2=10.0.14.4:2888:3888
clientPort=2181

每个节点对应的kafka和zookeeper分别是上面的格式.

The respective kafka and zookeeper for each node is in above format respectively.

当我检查其余节点的 zookeeper 状态时,我可以看到一个新的领导者.但是生产者仍然无法发送数据.还有两个 kafka 节点没有响应以下错误.

When i check zookeeper status of rest of the nodes i can see a new leader. But still producer fails to send data. Also two kafka nodes not responding with below error.

WARN Client session timed out, have not heard from server in 30004ms for sessionid 0x0 (org.apache.zookeeper.ClientCnxn)

谁能帮我解决这个问题?

Can anyone help me with this?

如果您想要来自可用节点的 kafka 日志?

If you want kafka logs from the available node?

[2020-10-08 19:40:13,607] WARN Client session timed out, have not heard from server in 12002ms for sessionid 0x2acefe00000 (org.apache.zookeeper.ClientCnxn)
[2020-10-08 19:40:13,608] INFO Client session timed out, have not heard from server in 12002ms for sessionid 0x2acefe00000, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2020-10-08 19:40:13,709] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient)
[2020-10-08 19:40:13,709] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient)
[2020-10-08 19:40:13,709] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient)
[2020-10-08 19:40:13,709] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient)
[2020-10-08 19:40:13,866] INFO Opening socket connection to server 10.0.14.7/10.0.14.7:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2020-10-08 19:40:13,867] INFO Socket error occurred: 10.0.14.7/10.0.14.7:2181: Connection refused (org.apache.zookeeper.ClientCnxn)
[2020-10-08 19:40:13,968] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient)
[2020-10-08 19:40:13,968] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient)
[2020-10-08 19:40:14,093] WARN [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Connection to node 1 (/10.0.2.5:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2020-10-08 19:40:14,093] INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Error sending fetch request (sessionId=205463854, epoch=INITIAL) to node 1: {}. (org.apache.kafka.clients.FetchSessionHandler)
java.io.IOException: Connection to 10.0.2.5:9092 (id: 1 rack: null) failed.
        at org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:71)
        at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:103)
        at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:206)
        at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:300)
        at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:135)
        at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:134)
        at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:117)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)

推荐答案

当您仅在 KafkaProducer 属性 bootstrap_servers 中提及其中一个代理时,通常会发生这种情况.我的假设是此属性仅设置为代理 10.0.2.5:9092 而不是列出所有三个节点.

This typically happens when you only mention one of the brokers in the KafkaProducer properties bootstrap_servers. My assumption is that this property is set only to the broker 10.0.2.5:9092 instead of listing all three nodes.

尽管仅提及其中一个就足以与整个集群通信,但建议至少列出两个代理地址(以逗号分隔的列表)来处理您面临的此类场景.

Although it is sufficient to mention only one of them to be able to communicate with the entire cluster, it is recommended to have listed at least two broker addresses (as a comma seperated list) to deal with such scenarios your are facing.

如您在日志中看到的那样,如果单个代理失败,代理可能会将分区领导者切换为活动代理.即使您没有列出属性 bootstrap_servers 中的所有代理,生产者也会确定它需要将数据发送到哪个代理(分区领导者).

In case of an individual broker failure the broker might switch partition leader to active brokers as you have seen in the logs. Even though you are not listing all of the brokers in the property bootstrap_servers the producer will figure out to which broker (partition leader) it needs to send the data to.

这篇关于如果一个 kafka 节点宕机,整个集群就会失败?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆