kubernetes节点关闭/崩溃恢复? [英] kubernetes node shutdown/crash recovery?
问题描述
我们有一个集群,其中有一个主节点(foo-1)和两个工作节点(foo-2和foo-3).我们有一个在foo-3上运行的pod(由Kubernetes决定).我们有意关闭foo-3作为实验.
We have a cluster with a master node (foo-1), and two worker nodes (foo-2 and foo-3). We have a pod that was running on foo-3 (as decided by Kubernetes). We purposely shut down foo-3 as an experiment.
我的期望是Kubernetes会看到"关闭,并在foo-2中自动重新启动Pod.但是,这似乎没有发生.实际上,似乎认为该Pod仍在foo-3上运行.
My expectation was that Kubernetes would "see" the shutdown, and automatically restart the pod in foo-2. But, it didn't seem to happen. In fact, it seemed to think that the pod was still running on foo-3.
等待五分钟后,Kubernetes最终意识到该群集节点已消失,并通过重新启动foo-2上的Pod进行了优雅的响应.五分钟对我们来说太长了,因为这不是复制的应用程序.我们如何才能大大缩短超时时间(例如10秒)?实际上,如果主机正常关闭(例如进行修补),则效果应该是立竿见影的.
After five minutes of waiting, Kubernetes finally recognized that the cluster node had disappeared, and responded gracefully by restarting the pod on foo-2. Five minutes is too long for us, as this is not a replicated application. How can we make that timeout drastically shorter (like, 10 seconds)? And actually, if the host has a graceful shutdown (like for patching), the effect should be immediate.
推荐答案
--pod-eviction-timeout参数="noreferrer"> kube-controller-manager (默认为5m):
There is a --pod-eviction-timeout
parameter in kube-controller-manager which is 5m by default:
--pod-eviction-timeout duration The grace period for deleting pods on failed nodes. (default 5m0s)
如果要加快驱逐过程,则需要对其进行修改.
You need to modify it if you want to speed up an eviction process.
但是,如果要最大程度地减少Pod的停机时间,则当节点发生故障时,还需要修改以下参数:
But if you want to minimize your pod's downtime, when node goes down, you need to modify the following parameters as well:
kubelet: node-status-update-frequency=4s (default 10s)
kube-controller-manager: node-monitor-period=2s (default 5s)
kube-controller-manager: node-monitor-grace-period=16s (default 40s)
kube-controller-manager: pod-eviction-timeout=30s (default 5m)
当然,您始终可以使用副本2进行部署,即使一个节点发生故障,服务也将正常运行.
And, of course, you can always have your deployments with replica 2 and service will be up even if one node goes down.
这篇关于kubernetes节点关闭/崩溃恢复?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!