Vertx 集群事件总线不会删除 kubernetes 滚动部署上的旧节点 [英] Vertx clustered eventbus not removing old node on kubernetes rolling deployment

查看:39
本文介绍了Vertx 集群事件总线不会删除 kubernetes 滚动部署上的旧节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个在集群中运行的 vertx 微服务,并使用无头服务相互通信(link) 在本地云中.每当我进行滚动部署时,我都会面临服务内的连接问题.当我分析日志时,我可以看到旧节点/pod 正在从集群列表中删除,但事件总线没有删除它并在循环的基础上使用它.

I have two vertx micro services running in cluster and communicate with each other using a headless service(link) in on premise cloud. Whenever I do a rolling deployment I am facing connectivity issue within services. When I analysed the log I can see that old node/pod is getting removed from cluster list but the event bus is not removing it and using it in round robin basis.

以下是部署前的成员组信息

Below is the member group information before deployment

    Member [192.168.4.54]:5701 - ace32cef-8cb2-4a3b-b15a-2728db068b80        //pod 1
    Member [192.168.4.54]:5705 - f0c39a6d-4834-4b1d-a179-1f0d74cabbce this
    Member [192.168.101.79]:5701 - ac0dcea9-898a-4818-b7e2-e9f8aaefb447      //pod 2

部署开始时,pod 2 从成员列表中删除,

When deployment is started, pod 2 gets removed from the member list,

[192.168.4.54]:5701 [dev] [4.0.2] Could not connect to: /192.168.101.79:5701. Reason: SocketException[Connection refused to address /192.168.101.79:5701]
    Removing connection to endpoint [192.168.101.79]:5701 Cause => java.net.SocketException {Connection refused to address /192.168.101.79:5701}, Error-Count: 5
    Removing Member [192.168.101.79]:5701 - ac0dcea9-898a-4818-b7e2-e9f8aaefb447

并且添加了新成员,

Member [192.168.4.54]:5701 - ace32cef-8cb2-4a3b-b15a-2728db068b80
    Member [192.168.4.54]:5705 - f0c39a6d-4834-4b1d-a179-1f0d74cabbce this
    Member [192.168.94.85]:5701 - 1347e755-1b55-45a3-bb9c-70e07a29d55b  //new pod
All migration tasks have been completed. (repartitionTime=Mon May 10 08:54:19 MST 2021, plannedMigrations=358, completedMigrations=358, remainingMigrations=0, totalCompletedMigrations=3348, elapsedMigrationTime=1948ms, totalElapsedMigrationTime=27796ms)

但是当对已部署的服务发出请求时,尽管旧 pod 从成员组中删除,但事件总线使用旧 pod/服务参考(ac0dcea9-898a-4818-b7e2-e9f8aaefb447),

But when a request is made to the deployed service, event though old pod is removed from member group the event bus is using the old pod/service reference(ac0dcea9-898a-4818-b7e2-e9f8aaefb447),

[vert.x-eventloop-thread-1] DEBUG io.vertx.core.eventbus.impl.clustered.ConnectionHolder - tx.id=f9f5cfc9-8ad8-4eb1-b12c-322feb0d1acd Not connected to server ac0dcea9-898a-4818-b7e2-e9f8aaefb447 - starting queuing

我查看了滚动部署的官方文档,我的部署似乎遵循文档中提到的两个关键事项,仅删除了一个 Pod,然后添加了新的 Pod.

I checked the official documentation for rolling deployment and my deployment seems to be following two key things mentioned in documentation, only one pod removed and then the new one is added.

never start more than one new pod at once

forbid more than one unavailable pod during the process

我使用的是 vertx 4.0.3 和 hazelcast kubernetes 1.2.2.我的 Verticle 类正在扩展 AbstractVerticle 并使用,

I am using vertx 4.0.3 and hazelcast kubernetes 1.2.2. My verticle class is extending AbstractVerticle and deploying using,

Vertx.clusteredVertx(options, vertx -> {
                    vertx.result().deployVerticle(verticleName, deploymentOptions);

抱歉,帖子太长,非常感谢您的帮助.

Sorry for the long post, any help is highly appreciated.

推荐答案

一个可能的原因可能是 竞争条件,Kubernetes 在 Kube-proxy 中删除 pod 并更新端点 在这篇详尽的文章中有详细说明.这种竞争条件将导致 Kubernetes 在 Pod 终止后继续向被删除的 Pod 发送流量.

One possible reason could be due to a race condition with Kubernetes removing the pod and updating the endpoint in Kube-proxy as detailed in this extensive article. This race condition will lead to Kubernetes continuing to send traffic to the pod being removed after it has terminated.

一种TL;DR 解决方案是在通过以下任一方式终止 Pod 时添加延迟:

One TL;DR solution is to add a delay when terminating a pod by either:

  1. 在收到 SIGTERM 时设置服务延迟(例如 15 秒),以便它在延迟期间像往常一样继续响应请求.
  2. 使用 Kubernetes preStop 钩子在容器上执行 sleep 15 命令.这允许服务在 Kubernetes 更新其端点的 15 秒期间继续响应请求.当 preStop 钩子完成时,Kubernetes 将发送 SIGTERM.
  1. Have the service delay when it receives a SIGTERM (e.g. for 15 sec) such that it keeps responding to requests during that delay period like normal.
  2. Use the Kubernetes preStop hook to execute a sleep 15 command on the container. This allows the service to continue responding to requests during that 15 second period while Kubernetes is updating it's endpoints. Kubernetes will send SIGTERM when the preStop hook completes.

这两种解决方案都会给 Kubernetes 一些时间来将更改传播到其内部组件,以便流量停止路由到被删除的 pod.

Both solutions will give Kubernetes some time to propagate changes to it's internal components so that traffic stops being routed to the pod being removed.

这个答案的一个警告是,我不熟悉 Hazelcast Clustering 以及您的特定发现模式是如何设置的.

A caveat to this answer is that I'm not familiar with Hazelcast Clustering and how your specific discover mode is setup.

这篇关于Vertx 集群事件总线不会删除 kubernetes 滚动部署上的旧节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆