Kafka分区在某些节点上不同步 [英] Kafka partitions out of sync on certain nodes

查看:42
本文介绍了Kafka分区在某些节点上不同步的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在3个EC2实例上运行Kafka集群.每个实例都运行kafka(0.11.0.1)和zookeeper(3.4).我的主题已配置为每个都有20个分区,而ReplicationFactor为3.

I'm running a Kafka cluster on 3 EC2 instances. Each instance runs kafka (0.11.0.1) and zookeeper (3.4). My topics are configured so that each has 20 partitions and ReplicationFactor of 3.

今天,我注意到某些分区拒绝同步到所有三个节点.这是一个示例:

Today I noticed that some partitions refuse to sync to all three nodes. Here's an example:

bin/kafka-topics.sh --zookeeper "10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181" --describe --topic prod-decline
Topic:prod-decline    PartitionCount:20    ReplicationFactor:3    Configs:
    Topic: prod-decline    Partition: 0    Leader: 2    Replicas: 1,2,0    Isr: 2
    Topic: prod-decline    Partition: 1    Leader: 2    Replicas: 2,0,1    Isr: 2
    Topic: prod-decline    Partition: 2    Leader: 0    Replicas: 0,1,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 3    Leader: 1    Replicas: 1,0,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 4    Leader: 2    Replicas: 2,1,0    Isr: 2
    Topic: prod-decline    Partition: 5    Leader: 2    Replicas: 0,2,1    Isr: 2
    Topic: prod-decline    Partition: 6    Leader: 2    Replicas: 1,2,0    Isr: 2
    Topic: prod-decline    Partition: 7    Leader: 2    Replicas: 2,0,1    Isr: 2
    Topic: prod-decline    Partition: 8    Leader: 0    Replicas: 0,1,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 9    Leader: 1    Replicas: 1,0,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 10    Leader: 2    Replicas: 2,1,0    Isr: 2
    Topic: prod-decline    Partition: 11    Leader: 2    Replicas: 0,2,1    Isr: 2
    Topic: prod-decline    Partition: 12    Leader: 2    Replicas: 1,2,0    Isr: 2
    Topic: prod-decline    Partition: 13    Leader: 2    Replicas: 2,0,1    Isr: 2
    Topic: prod-decline    Partition: 14    Leader: 0    Replicas: 0,1,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 15    Leader: 1    Replicas: 1,0,2    Isr: 2,0,1
    Topic: prod-decline    Partition: 16    Leader: 2    Replicas: 2,1,0    Isr: 2
    Topic: prod-decline    Partition: 17    Leader: 2    Replicas: 0,2,1    Isr: 2
    Topic: prod-decline    Partition: 18    Leader: 2    Replicas: 1,2,0    Isr: 2
    Topic: prod-decline    Partition: 19    Leader: 2    Replicas: 2,0,1    Isr: 2

只有节点2具有所有同步数据.我尝试过重新启动代理0和1,但这并不能改善情况-甚至更糟.我很想重新启动节点2,但是我假设它将导致停机或群集故障,因此,我希望尽可能避免这种情况.

Only node 2 has all the data in-sync. I've tried restarting brokers 0 and 1 but it didn't improve the situation - it made it even worse. I'm tempted to restart node 2 but I'm assuming it will lead to downtime or cluster failure so I'd like to avoid it if possible.

我没有在日志中看到任何明显的错误,因此我很难确定如何调试情况.任何提示将不胜感激.

I'm not seeing any obvious errors in logs so I'm having a hard time figuring out how to debug the situation. Any tips would be greatly appreciated.

谢谢!

一些附加信息...如果我检查节点2(具有完整数据的节点)上的度量标准,它的确会意识到某些分区没有正确复制.

Some additional info ... If I check the metrics on node 2 (the one with full data), it does realize that some partitions are not correctly replicated.:

$>get -d kafka.server -b kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions *
#mbean = kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions:
Value = 930;

节点0和1没有.他们似乎认为一切都很好:

Nodes 0 and 1 don't. They seem to think everything is fine:

$>get -d kafka.server -b kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions *
#mbean = kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions:
Value = 0;

这是预期的行为吗?

推荐答案

尝试增加 replica.lag.time.max.ms .

说明如下:

如果副本未能发送获取请求的时间超过 replica.lag.time.max.ms 的时间,则该副本被视为已失效,并已从ISR中删除.

If a replica fails to send a fetch request for longer than replica.lag.time.max.ms, it is considered dead and is removed from the ISR.

如果副本开始滞后于领导者的时间超过 replica.lag.time.max.ms ,则认为该副本太慢,因此已从ISR中删除.因此,即使通信量激增并且在头服务器上写入大量消息,除非副本始终保持副本服务器的副本后方的时间,并且副本不会出现在ISR中.

If a replica starts lagging behind the leader for longer than replica.lag.time.max.ms, then it is considered too slow and is removed from the ISR. So even if there is a spike in traffic and large batches of messages are written on the leader, unless the replica consistently remains behind the leader for replica.lag.time.max.ms, it will not shuffle in and out of the ISR.

这篇关于Kafka分区在某些节点上不同步的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆