当 Kubernetes 主节点出现故障时会发生什么? [英] What happens when the Kubernetes master fails?

查看:34
本文介绍了当 Kubernetes 主节点出现故障时会发生什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直试图弄清楚当 Kubernetes 主节点在只有一个主节点的集群中出现故障时会发生什么.如果发生这种情况,Web 请求是否仍会路由到 Pod,或者整个系统是否只是关闭?

I've been trying to figure out what happens when the Kubernetes master fails in a cluster that only has one master. Do web requests still get routed to pods if this happens, or does the entire system just shut down?

根据建立在 Kubernetes 之上的 OpenShift 3 文档,(https://docs.openshift.com/enterprise/3.2/architecture/infrastructure_components/kubernetes_infrastructure.html),如果一个主节点发生故障,节点会继续正常运行,但系统会失去管理 Pod 的能力.这对于 vanilla Kubernetes 是一样的吗?

According to the OpenShift 3 documentation, which is built on top of Kubernetes, (https://docs.openshift.com/enterprise/3.2/architecture/infrastructure_components/kubernetes_infrastructure.html), if a master fails, nodes continue to function properly, but the system looses its ability to manage pods. Is this the same for vanilla Kubernetes?

推荐答案

在典型设置中,主节点同时运行 API 和 etcd,并且主要或完全负责管理底层云基础架构.当它们离线或降级时,API 将离线或降级.

In typical setups, the master nodes run both the API and etcd and are either largely or fully responsible for managing the underlying cloud infrastructure. When they are offline or degraded, the API will be offline or degraded.

如果它们、etcd 或 API 完全离线,则该集群将不再是一个集群,而是在此期间的一堆临时节点.集群将无法响应节点故障、创建新资源、将 Pod 移动到新节点等.直到两者:

In the event that they, etcd, or the API are fully offline, the cluster ceases to be a cluster and is instead a bunch of ad-hoc nodes for this period. The cluster will not be able to respond to node failures, create new resources, move pods to new nodes, etc. Until both:

  1. 足够多的 etcd 实例重新上线以形成法定人数并取得进展(有关其工作原理和这些术语含义的直观解释,请参阅 此页面).
  2. 至少一个 API 服务器可以为请求提供服务

在部分降级状态下,API 服务器可能能够响应仅读取数据的请求.

In a partially degraded state, the API server may be able to respond to requests that only read data.

但是,无论如何,除非重新启动节点,或者在此期间出现某种严重故障,否则应用程序的生命周期将照常继续,因为 TCP/UDP 服务、负载均衡器、DNS、仪表板等不支持.都应该至少继续运行一段时间.最终,这些事情都会在不同的时间尺度上失败.在单个主设置或完整的 API 故障中,DNS 故障可能会在缓存过期时首先发生(以分钟为单位,虽然确切的时间是可配置的,请参阅coredns 缓存插件文档).这是考虑多主设置的一个很好的理由——DNS 和服务路由可以在降级状态下无限期地继续运行,即使 etcd 不能再取得进展.

However, in any case, life for applications will continue as normal unless nodes are rebooted, or there is a dramatic failure of some sort during this time, because TCP/ UDP services, load balancers, DNS, the dashboard, etc. Should all continue to function for at least some time. Eventually, these things will all fail on different timescales. In single master setups or complete API failure, DNS failure will probably happen first as caches expire (on the order of minutes, though the exact timing is configurable, see the coredns cache plugin documentation). This is a good reason to consider a multi-master setup–DNS and service routing can continue to function indefinitely in a degraded state, even if etcd can no longer make progress.

作为操作员,您可以采取一些措施来加速故障,尤其是在完全降级的状态下.例如,重新启动一个节点会导致 DNS 查询,实际上可能会导致所有 pod 和服务网络功能,直到至少有一个 master 重新上线.重新启动 DNS pod 或 kube-proxy 也会很糟糕.

There are actions that you could take as an operator which would accelerate failures, especially in a fully degraded state. For instance, rebooting a node would cause DNS queries and in fact probably all pod and service networking functionality until at least one master comes back online. Restarting DNS pods or kube-proxy would also be bad.

如果你想自己测试一下,我推荐 kubeadm-dind-集群kind 或者,对于更奇特的设置,虚拟机或裸机上的 kubeadm.注意:kubectl proxy 在 API 故障期间将不起作用,因为它会通过主节点路由流量.

If you'd like to test this out yourself, I recommend kubeadm-dind-cluster, kind or, for more exotic setups, kubeadm on VMs or bare metal. Note: kubectl proxy will not work during API failure, as that routes traffic through the master(s).

这篇关于当 Kubernetes 主节点出现故障时会发生什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆