如何确保kafka集群完全启动? [英] How to ensure a kafka cluster is fully up?

查看:31
本文介绍了如何确保kafka集群完全启动?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有 5 个节点集群在生产中运行,有 3 个 zookeeper - 都是虚拟机.我们必须经常重新启动集群以进行一些硬件修补.

We have five node cluster running in production with 3 zookeepers - all are VMs. We have to restart the cluster often for some hardware patching.

我们已经编写了一个ansible脚本来按以下顺序关闭集群,

We have written an ansible script to shutdown the cluster in the following order,

  1. 通过终止进程来停止 Kafka 连接(依次连接 1、2、3 个节点)
  2. 使用 kafka-server-stop.sh 停止 Kafka(依次停止 1、2、3、4、5 个节点)
  3. 使用 zookeeper-server-stop.sh 停止 Zookeeper(依次为 1、2、3 个节点)

打补丁后,启动脚本会做以下事情

After patching, start script will do the following

  1. 使用 zookeeper-server-start.sh 启动 Zookeeper(依次启动 1、2、3 个节点)
  2. 使用 kafka-server-start.sh 启动 Kafka(依次启动 1、2、3、4、5 个节点)
  3. 使用 connect-distributed.sh 启动 Kafka 连接(依次连接 1、2、3 个节点)

问题在于启动脚本的 #3 步骤,我们在执行 #3(启动 kafka 连接)之前保持了大约 10 分钟的硬编码延迟,以使 kafka 集群完全启动并运行.但有时,集群中的某些节点需要更多时间才能启动,因此即使延迟后 kafka 连接启动也会失败 - 在这种情况下,我们必须等待 30 分钟,然后再次尝试手动重新启动连接.

The issue is with the #3 step of start script, we have kept a hard coded delay about 10 mins before executing #3 (starting kafka connect) to make kafka cluster is fully up and running. But sometimes, some of the nodes in the cluster take more time to start, hence kafka connect start up fails even after the delay - In this case we have to wait for 30 mins and try restarting the connect manually again.

在我启动其他进程之前,有没有办法确保集群中的所有节点都已启动并正在运行?

Is there any way to make sure that all nodes in the cluster is up and running, before I start the other processes?

提前致谢.

硬编码延迟不起作用,我们不能通过一些假设继续改变延迟

Hard coded delay does not work, we can't keep on changing the delay with some assumption

推荐答案

一旦所有的broker都启动了,我们就可以使用下面的cmds来检查它们是否已经形成集群了.

Once all brokers have been started we can use following cmds, to check whether they have formed a cluster or not.

  • 从 kafka-1 对其余代理运行以下命令,即 i = 2、3、4 和 5:

  • From kafka-1 run the following command against the rest of the brokers, i.e. i = 2, 3, 4 and 5:

  • nc -vz kafka-i 9092 [应该返回连接成功]
  • nc -vz kafka-i 9092 [It should return connection succeeded]

跟踪每个代理节点中的 server.log.它应该提供有关集群的信息.

tail the server.log in each broker node. It should give the info about the cluster.

从 Kafka bin 目录,您可以定期运行 ./zookeeper-shell.sh zk_host:zk_port 并执行 ls/brokers/ids.它应该给你五个条目,例如[0, 1, 2, 3, 4] 如果所有5个broker都注册到zookeeper.

From Kafka bin directory, You can periodically run ./zookeeper-shell.sh zk_host:zk_port and execute ls /brokers/ids. It should gives you five entries, e.g. [0, 1, 2, 3, 4] if all 5 brokers have registered to the zookeeper.

一个(较少涉及的)hack 可能是创建一个包含 5 个分区的测试主题,然后等待每个代理获得 1 个分区.

One dirty (less involved) hack might be to create a test topic with 5 partitions, and wait until each broker gets 1 partition to itself.

这篇关于如何确保kafka集群完全启动?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆