不小心耗尽了Kubernetes中的所有节点(甚至是主节点)。我该如何带回Kubernetes? [英] Accidentally drained all nodes in Kubernetes (even master). How can I bring my Kubernetes back?

查看:648
本文介绍了不小心耗尽了Kubernetes中的所有节点(甚至是主节点)。我该如何带回Kubernetes?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不小心耗尽了Kubernetes中的所有节点(甚至是主节点)。我该如何带回Kubernetes? kubectl不再工作:

I accidentally drained all nodes in Kubernetes (even master). How can I bring my Kubernetes back? kubectl is not working anymore:

kubectl get nodes

结果:

The connection to the server 172.16.16.111:6443 was refused - did you specify the right host or port?

这是主节点上 systemctl status kubelet 的输出( node1):

Here is the output of systemctl status kubelet on master node (node1):

● kubelet.service - Kubernetes Kubelet Server
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2020-06-23 21:42:39 UTC; 25min ago
     Docs: https://github.com/GoogleCloudPlatform/kubernetes
 Main PID: 15541 (kubelet)
    Tasks: 0 (limit: 4915)
   CGroup: /system.slice/kubelet.service
           └─15541 /usr/local/bin/kubelet --logtostderr=true --v=2 --node-ip=172.16.16.111 --hostname-override=node1 --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --config=/etc/kubernetes/kubelet-config.yaml --kubeconfig=/etc/kubernetes/kubelet.conf --pod-infra-container-image=gcr.io/google_containers/pause-amd64:3.1 --runtime-cgroups=/systemd/system.slice --cpu-manager-policy=static --kube-reserved=cpu=1,memory=2Gi,ephemeral-storage=1Gi --system-reserved=cpu=1,memory=2Gi,ephemeral-storage=1Gi --network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin

Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.330009   15541 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.330201   15541 setters.go:73] Using node IP: "172.16.16.111"
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.331475   15541 kubelet_node_status.go:472] Recording NodeHasSufficientMemory event message for node node1
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.331494   15541 kubelet_node_status.go:472] Recording NodeHasNoDiskPressure event message for node node1
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.331500   15541 kubelet_node_status.go:472] Recording NodeHasSufficientPID event message for node node1
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.331661   15541 policy_static.go:244] [cpumanager] static policy: RemoveContainer (container id: 6dd59735cabf973b6d8b2a46a14c0711831daca248e918bfcfe2041420931963)
Jun 23 22:08:34 node1 kubelet[15541]: E0623 22:08:34.332058   15541 pod_workers.go:191] Error syncing pod 93ff1a9840f77f8b2b924a85815e17fe ("kube-apiserver-node1_kube-system(93ff1a9840f77f8b2b924a85815e17fe)"), skipping: failed to "StartContainer" for "kube-apiserver" with CrashLoopBackOff: "back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-node1_kube-system(93ff1a9840f77f8b2b924a85815e17fe)"
Jun 23 22:08:34 node1 kubelet[15541]: E0623 22:08:34.427587   15541 kubelet.go:2267] node "node1" not found
Jun 23 22:08:34 node1 kubelet[15541]: E0623 22:08:34.506152   15541 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/kubelet.go:450: Failed to list *v1.Service: Get https://172.16.16.111:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.16.16.111:6443: connect: connection refused
Jun 23 22:08:34 node1 kubelet[15541]: E0623 22:08:34.527813   15541 kubelet.go:2267] node "node1" not found

我正在使用Ubuntu 18.04,并且群集中有7个计算节点。全部耗尽(偶然,有点!)!我已经使用Kubespray安装了我的K8s集群。

I'm using Ubuntu 18.04, and there are 7 compute nodes in my cluster. All drained (accidentally, kind of!)! I've installed my K8s cluster using Kubespray.

是否可以取消所有这些节点的密码?这样可以安排k8个必要的广告连播。

Is there any way to uncordon any of these nodes? So that k8s necessary pods can be scheduled.

我们将不胜感激。

更新

我在这里问了一个关于如何连接到etcd的单独问题:无法连接到Kubernetes的ETCD

I asked a seperate question about how to connect to etcd here: Can't connect to the ETCD of Kubernetes

推荐答案

如果您有生产或实时工作负载,最好的安全方法是配置新群集并逐步切换工作负载。

If you have production or 'live' workloads, the best safe approach is to provision a new cluster and switch the workloads gradually.

Kubernetes将其状态保持在 etcd ,因此您可能会连接到etcd并清除已耗尽状态,但是您可能必须查看源代码,并查看发生在哪里以及在etcd中存储的特定键/值。

Kubernetes keeps its state in etcd so you could potentially connect to etcd and clear the 'drained' state but you will probably have to look at the source code and see where that happens and where the specific key/values are stored in etcd.

您共享的日志基本上表明kube-apiserver无法启动,因此很可能是它试图连接到etcd / startup,而etcd告诉它:您无法在此节点上启动,因为它已经耗尽。

The logs that you shared are basically showing that the kube-apiserver cannot start so it's likely that it's trying to connect to etcd/startup and etcd is telling it: "you cannot start on this node because it has been drained".

主服务器的典型启动顺序如下:

The typical startup sequence for the masters is something like this:


  • etcd

  • kube-apiserver

  • kube-controller-manager

  • kube-scheduler

  • etcd
  • kube-apiserver
  • kube-controller-manager
  • kube-scheduler

您还可以按照任何指南连接到etcd,看看是否可以进一步解决问题。例如,。然后,您可以检查/删除一些节点密钥,后果自负:

You can also follow any guide to connect to etcd and see if you can troubleshoot any further. For example, this one. Then you could examine/delete some of the node keys at your own risk:

/registry/minions/node-x1
/registry/minions/node-x2
/registry/minions/node-x3

这篇关于不小心耗尽了Kubernetes中的所有节点(甚至是主节点)。我该如何带回Kubernetes?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆