kafka 新生产者在其中一个代理宕机后无法更新元数据 [英] kafka new producer is not able to update metadata after one of the broker is down

查看:25
本文介绍了kafka 新生产者在其中一个代理宕机后无法更新元数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 kafka 环境,其中有 2 个经纪人和 1 个动物园管理员.

I have an kafka environment which has 2 brokers and 1 zookeeper.

当我尝试向 kafka 生成消息时,如果我停止代理 1(即领导者),客户端将停止生成消息并给我以下错误,尽管代理 2 被选为该主题的新领导者并且分区.

While I am trying to produce messages to kafka, if i stop broker 1(which is the leader one) the client stops producing messaging and give me the below error although the broker 2 is elected as a new leader for the topic and partions.

org.apache.kafka.common.errors.TimeoutException:60000 毫秒后更新元数据失败.

org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.

10 分钟过去后,由于代理 2 是新的领导者,我希望生产者将数据发送到代理 2,但由于给出上述异常,它继续失败.lastRefreshMs 和 lastSuccessfullRefreshMs 仍然相同,尽管生产者的 metadataExpireMs 为 300000.

After 10 minutes passed, since broker 2 is new leader i expected producer to send data to broker 2 but it continued failing by giving above exception. lastRefreshMs and lastSuccessfullRefreshMs is still same although the metadataExpireMs is 300000 for producer.

我在生产者端使用 kafka 新的生产者实现.

I am using kafka new Producer implementation on producer side.

似乎当生产者启动时,它绑定到一个代理,如果该代理出现故障,它甚至不会尝试连接到集群中的另一个代理.

It seems that when producer is initiated, it binds to one broker and if that broker goes down it is not even trying to connect to another brokers in cluster.

但我的期望是,如果一个代理出现故障,它应该直接检查其他可用代理的元数据并将数据发送给他们.

But my expectation is if a broker goes down, it should directly check metadata for another brokers that are available and send data to them.

顺便说一句,我的主题是 4 个分区,复制因子为 2.提供此信息以防万一.

Btw my topic is 4 partition and has replication factor of 2. Giving this info in case it makes sense.

配置参数.

{request.timeout.ms=30000, retry.backoff.ms=100, buffer.memory=33554432, ssl.truststore.password=null, batch.size=16384, ssl.keymanager.algorithm=SunX509, receive.buffer.bytes=32768, ssl.cipher.suites=null, ssl.key.password=null, sasl.kerberos.ticket.renew.jitter=0.05, ssl.provider=null, sasl.kerberos.service.name=null, max.in.flight.requests.per.connection=5, sasl.kerberos.ticket.renew.window.factor=0.8, bootstrap.servers=[10.201.83.166:9500, 10.201.83.167:9500], client.id=rest-interface, max.request.size=1048576, acks=1, linger.ms=0, sasl.kerberos.kinit.cmd=/usr/bin/kinit, ssl.enabled.protocols=[TLSv1.2, TLSv1.1, TLSv1], metadata.fetch.timeout.ms=60000, ssl.endpoint.identification.algorithm=null, ssl.keystore.location=null, value.serializer=class org.apache.kafka.common.serialization.ByteArraySerializer, ssl.truststore.location=null, ssl.keystore.password=null, key.serializer=class org.apache.kafka.common.serialization.ByteArraySerializer, block.on.buffer.full=false, metrics.sample.window.ms=30000, metadata.max.age.ms=300000, security.protocol=PLAINTEXT, ssl.protocol=TLS, sasl.kerberos.min.time.before.relogin=60000, timeout.ms=30000, connections.max.idle.ms=540000, ssl.trustmanager.algorithm=PKIX, metric.reporters=[], compression.type=none, ssl.truststore.type=JKS, max.block.ms=60000, retries=0, send.buffer.bytes=131072, partitioner.class=class org.apache.kafka.clients.producer.internals.DefaultPartitioner, reconnect.backoff.ms=50, metrics.num.samples=2, ssl.keystore.type=JKS}

用例:

1- 启动 BR1 和 BR2 生产数据(Leader 是 BR1)

1- Start BR1 and BR2 Produce data (Leader is BR1)

2- 停止 BR2 产生数据(良好)

2- Stop BR2 produce data(fine)

3- 停止 BR1(这意味着此时集群中没有活动的工作代理)然后启动 BR2 并生成数据(失败,尽管领导者是 BR2)

3- Stop BR1(which means there is no active working broker in cluster at this time) and then Start BR2 and produce data (failed although leader is BR2)

4- 开始BR1产生数据(leader仍然是BR2但数据产生精细)

4- Start BR1 produce data(leader is still BR2 but data is produced finely)

5- 阻止 BR2(现在 BR1 是领导者)

5- Stop BR2(now BR1 is leader)

6- 阻止 BR1(BR1 仍然是领先者)

6- Stop BR1(BR1 is still leader)

7- 启动 BR1 产生数据(消息再次产生良好)

7- Start BR1 produce data(message is produced fine again)

如果生产者将最新的成功数据发送给 BR1,然后所有代理都宕机了,生产者希望 BR1 再次起床,尽管 BR2 已经起床并且是新的领导者.这是预期的行为吗?

If producer send the latest successful data to BR1 and then all brokers goes down, the producer expects BR1 to get up again although BR2 is up and new leader. Is this an expected behaviour?

推荐答案

花了几个小时后,我弄清楚了 kafka 在我的情况下的行为.可能这是一个错误,或者可能需要以这种方式完成,原因在于引擎盖下的原因,但实际上,如果我会这样做,我不会这样做:)

After spending hours I figured out the behaviour of kafka in my situation. May be this is a bug or may be this needs to be done this way for the reasons lie under the hood but actually if i would do such implementation i wouldn't do this way :)

当所有代理都宕机时,如果您只能启动一个代理,那么这必须是最后一个宕机的代理,以便成功生成消息.

When all brokers goes down, if you are able to get up only one broker this must be the broker which went down last in order to produce messages successfully.

假设您有 5 个经纪人;BR1、BR2、BR3、BR4 和 BR5.如果一切都失败并且最后死掉的broker是BR3(它是最后一个leader),虽然你启动了所有的broker BR1、BR2、BR4和BR5,但除非你启动BR3,否则它没有任何意义.

Let's say you have 5 brokers; BR1, BR2, BR3, BR4 and BR5. If all goes down and if the lastly dead broker is BR3(which was the last leader), although you start all brokers BR1, BR2, BR4 and BR5, it will not make any sense unless you start BR3.

这篇关于kafka 新生产者在其中一个代理宕机后无法更新元数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆