kubernetes节点关闭/崩溃恢复? [英] kubernetes node shutdown/crash recovery?

查看:107
本文介绍了kubernetes节点关闭/崩溃恢复?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个集群,其中有一个主节点(foo-1)和两个工作节点(foo-2和foo-3).我们有一个在foo-3上运行的pod(由Kubernetes决定).我们有意关闭foo-3作为实验.

We have a cluster with a master node (foo-1), and two worker nodes (foo-2 and foo-3). We have a pod that was running on foo-3 (as decided by Kubernetes). We purposely shut down foo-3 as an experiment.

我的期望是Kubernetes会看到"关闭,并在foo-2中自动重新启动Pod.但是,这似乎没有发生.实际上,似乎认为该Pod仍在foo-3上运行.

My expectation was that Kubernetes would "see" the shutdown, and automatically restart the pod in foo-2. But, it didn't seem to happen. In fact, it seemed to think that the pod was still running on foo-3.

等待五分钟后,Kubernetes最终意识到该群集节点已消失,并通过重新启动foo-2上的Pod进行了优雅的响应.五分钟对我们来说太长了,因为这不是复制的应用程序.我们如何才能大大缩短超时时间(例如10秒)?实际上,如果主机正常关闭(例如进行修补),则效果应该是立竿见影的.

After five minutes of waiting, Kubernetes finally recognized that the cluster node had disappeared, and responded gracefully by restarting the pod on foo-2. Five minutes is too long for us, as this is not a replicated application. How can we make that timeout drastically shorter (like, 10 seconds)? And actually, if the host has a graceful shutdown (like for patching), the effect should be immediate.

推荐答案

--pod-eviction-timeout参数="noreferrer"> kube-controller-manager (默认为5m):

There is a --pod-eviction-timeout parameter in kube-controller-manager which is 5m by default:

 --pod-eviction-timeout duration    The grace period for deleting pods on failed nodes. (default 5m0s)

如果要加快驱逐过程,则需要对其进行修改.

You need to modify it if you want to speed up an eviction process.

但是,如果要最大程度地减少Pod的停机时间,则当节点发生故障时,还需要修改以下参数:

But if you want to minimize your pod's downtime, when node goes down, you need to modify the following parameters as well:

kubelet: node-status-update-frequency=4s (default 10s)

kube-controller-manager: node-monitor-period=2s (default 5s)
kube-controller-manager: node-monitor-grace-period=16s (default 40s)
kube-controller-manager: pod-eviction-timeout=30s (default 5m)

当然,您始终可以使用副本2进行部署,即使一个节点发生故障,服务也将正常运行.

And, of course, you can always have your deployments with replica 2 and service will be up even if one node goes down.

这篇关于kubernetes节点关闭/崩溃恢复?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆