如果一个kafka节点出现故障,整个群集是否会失败? [英] Whole cluster failing if one kafka node goes down?

查看:185
本文介绍了如果一个kafka节点出现故障,整个群集是否会失败?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有3个节点kafka群集,每个群集都有zookeeper和kafka.如果我明确杀死了Zookeeper和kafka的领导节点,则整个集群将不接受任何传入的数据并等待该节点返回.

I have 3 node kafka cluster each having zookeeper and kafka. If i explicitly kill the leader node both zookeeper and kafka the whole cluster is not accepting any incoming data and waiting for the node to come back.

kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 3 min.insync.replicas=2 --partitions 6 --topic logs

使用上述命令创建的主题.

topic created using the above command.

节点1

server.properties

server.properties

broker.id=0
listeners=PLAINTEXT://:9092
advertised.listeners=PLAINTEXT://10.0.2.4:9092
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/tmp/kafka-logs
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=localhost:2181,10.0.2.5:2181,10.0.14.7:2181
zookeeper.connection.timeout.ms=18000
group.initial.rebalance.delay.ms=0

zookeeper.properties

zookeeper.properties

tickTime=2000 
dataDir=/tmp/zookeeper/ 
initLimit=5 
syncLimit=2 
server.0=0.0.0.0:2888:3888
server.1=analyzer1:2888:3888
server.2=10.0.14.4:2888:3888
clientPort=2181

每个节点各自的kafka和zookeeper分别采用上述格式.

The respective kafka and zookeeper for each node is in above format respectively.

当我检查其余节点的动物园管理员状态时,我可以看到一位新的负责人.但是生产者仍然无法发送数据.还有两个kafka节点未响应以下错误.

When i check zookeeper status of rest of the nodes i can see a new leader. But still producer fails to send data. Also two kafka nodes not responding with below error.

WARN Client session timed out, have not heard from server in 30004ms for sessionid 0x0 (org.apache.zookeeper.ClientCnxn)

有人可以帮我吗?

是否要从可用节点中获取kafka日志?

If you want kafka logs from the available node?

[2020-10-08 19:40:13,607] WARN Client session timed out, have not heard from server in 12002ms for sessionid 0x2acefe00000 (org.apache.zookeeper.ClientCnxn)
[2020-10-08 19:40:13,608] INFO Client session timed out, have not heard from server in 12002ms for sessionid 0x2acefe00000, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2020-10-08 19:40:13,709] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient)
[2020-10-08 19:40:13,709] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient)
[2020-10-08 19:40:13,709] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient)
[2020-10-08 19:40:13,709] INFO [ZooKeeperClient Kafka server] Connected. (kafka.zookeeper.ZooKeeperClient)
[2020-10-08 19:40:13,866] INFO Opening socket connection to server 10.0.14.7/10.0.14.7:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2020-10-08 19:40:13,867] INFO Socket error occurred: 10.0.14.7/10.0.14.7:2181: Connection refused (org.apache.zookeeper.ClientCnxn)
[2020-10-08 19:40:13,968] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient)
[2020-10-08 19:40:13,968] INFO [ZooKeeperClient Kafka server] Waiting until connected. (kafka.zookeeper.ZooKeeperClient)
[2020-10-08 19:40:14,093] WARN [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Connection to node 1 (/10.0.2.5:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2020-10-08 19:40:14,093] INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Error sending fetch request (sessionId=205463854, epoch=INITIAL) to node 1: {}. (org.apache.kafka.clients.FetchSessionHandler)
java.io.IOException: Connection to 10.0.2.5:9092 (id: 1 rack: null) failed.
        at org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:71)
        at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:103)
        at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:206)
        at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:300)
        at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:135)
        at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:134)
        at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:117)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)

推荐答案

通常,当您仅在KafkaProducer属性bootstrap_servers中提及一个中介时,就会发生这种情况.我的假设是,此属性仅设置为代理10.0.2.5:9092,而不列出所有三个节点.

This typically happens when you only mention one of the brokers in the KafkaProducer properties bootstrap_servers. My assumption is that this property is set only to the broker 10.0.2.5:9092 instead of listing all three nodes.

尽管仅提及其中一个就可以与整个集群进行通信,但建议列出至少两个代理地址(以逗号分隔的列表)以应对您面临的此类情况.

Although it is sufficient to mention only one of them to be able to communicate with the entire cluster, it is recommended to have listed at least two broker addresses (as a comma seperated list) to deal with such scenarios your are facing.

如果单个代理失败,则代理可能会将分区领导者切换到活动代理,如您在日志中所看到的.即使您没有在属性bootstrap_servers中列出所有经纪人,生产者也会弄清楚需要将数据发送给哪个经纪人(分区负责人).

In case of an individual broker failure the broker might switch partition leader to active brokers as you have seen in the logs. Even though you are not listing all of the brokers in the property bootstrap_servers the producer will figure out to which broker (partition leader) it needs to send the data to.

这篇关于如果一个kafka节点出现故障,整个群集是否会失败?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆