如何确保Kafka集群完全启动? [英] How to ensure a kafka cluster is fully up?

查看:29
本文介绍了如何确保Kafka集群完全启动?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有3个Zookeeper在生产中运行着五个节点群集-所有都是VM.为了进行一些硬件修补,我们必须经常重新启动集群.

We have five node cluster running in production with 3 zookeepers - all are VMs. We have to restart the cluster often for some hardware patching.

我们编写了一个ansible脚本,按以下顺序关闭集群,

We have written an ansible script to shutdown the cluster in the following order,

  1. 通过杀死进程来停止Kafka连接(依次连接1个,2个,3个节点)
  2. 使用kafka-server-stop.sh停止Kafka(依次1、2、3、4、5个节点)
  3. 使用zookeeper-server-stop.sh停止Zookeeper(依次1、2、3个节点)

修补后,启动脚本将执行以下操作

After patching, start script will do the following

  1. 使用zookeeper-server-start.sh启动Zookeeper(依次1、2、3个节点)
  2. 使用kafka-server-start.sh启动Kafka(依次1、2、3、4、5个节点)
  3. 使用connect-distributed.sh启动Kafka连接(依次1、2、3个节点)

问题在于启动脚本的#3步骤,在执行#3(启动kafka connect)以使kafka集群完全启动并运行之前,我们保持了大约10分钟的硬编码延迟.但是有时,群集中的某些节点需要更多的时间来启动,因此即使经过了延迟,kafka连接也无法启动-在这种情况下,我们必须等待30分钟,然后尝试再次手动重新启动连接.

The issue is with the #3 step of start script, we have kept a hard coded delay about 10 mins before executing #3 (starting kafka connect) to make kafka cluster is fully up and running. But sometimes, some of the nodes in the cluster take more time to start, hence kafka connect start up fails even after the delay - In this case we have to wait for 30 mins and try restarting the connect manually again.

在启动其他进程之前,有什么方法可以确保集群中的所有节点都已启动并正在运行吗?

Is there any way to make sure that all nodes in the cluster is up and running, before I start the other processes?

预先感谢.

硬编码延迟不起作用,我们无法在某些假设的情况下继续更改延迟

Hard coded delay does not work, we can't keep on changing the delay with some assumption

推荐答案

一旦所有代理都已启动,我们可以使用以下cmd来检查它们是否已形成集群.

Once all brokers have been started we can use following cmds, to check whether they have formed a cluster or not.

  • 在kafka-1中,对其余代理运行以下命令,即i = 2、3、4和5:

  • From kafka-1 run the following command against the rest of the brokers, i.e. i = 2, 3, 4 and 5:

  • nc -vz kafka-i 9092 [它应该返回连接成功]
  • nc -vz kafka-i 9092 [It should return connection succeeded]

tail server.log.它应该提供有关群集的信息.

tail the server.log in each broker node. It should give the info about the cluster.

从Kafka bin目录中,您可以定期运行 ./zookeeper-shell.sh zk_host:zk_port 并执行 ls/brokers/ids .它应该给您五个条目,例如 [0,1,2,3,4] (如果所有5个经纪人都已注册到Zookeeper).

From Kafka bin directory, You can periodically run ./zookeeper-shell.sh zk_host:zk_port and execute ls /brokers/ids. It should gives you five entries, e.g. [0, 1, 2, 3, 4] if all 5 brokers have registered to the zookeeper.

一个肮脏(涉及较少)的黑客可能是创建一个具有5个分区的测试主题,然后等到每个代理将1个分区分配给自己.

One dirty (less involved) hack might be to create a test topic with 5 partitions, and wait until each broker gets 1 partition to itself.

这篇关于如何确保Kafka集群完全启动?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆