运行批处理作业时，EKS 节点移动到 NodeNotReady 状态 [英] EKS node moves to NodeNotReady state when running a batch jobs

查看：28 发布时间：2021/10/26 18:55:23 kubernetes amazon amazon-eks kubelet

本文介绍了运行批处理作业时，EKS 节点移动到 NodeNotReady 状态的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在我的 EKS 集群中运行一个批处理作业来训练 ML 模型，并且训练持续了 8-10 个小时.但是，似乎运行作业的节点被杀死，作业在新节点上重新启动.我正在 Prometheus 中监控节点，似乎没有 CPU 或 OOM 问题.

I am running a batch job in my EKS cluster that trains a ML model and the training goes on for 8-10hours. However, it seems like the node on which the job runs moves is killed and the job is restarted on a new node. I am monitoring the Node in Prometheus and seems like there was no CPU or OOM issue.

我的下一个赌注是查看 EKS cloudtrail 日志，当节点被删除时，我会看到以下事件 -

My next bet was to look into the EKS cloudtrail logs and right when the node is removed I see below events -

kube-controller-manager 日志

controller_utils.go:179] Recording status change NodeNotReady event message for node XXX
controller_utils.go:121] Update ready status of pods on node [XXX]
event.go:274] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"XXX", UID:"1bf33ec8-41cd-434a-8579-3ba4b8cdd5f1", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node XXX status is now: NodeNotReady
node_lifecycle_controller.go:917] Node XXX is unresponsive as of 2021-06-09 01:00:48.962450508 +0000 UTC m=+5151508.967069635. Adding it to the Taint queue.
I0609 01:00:48.962465 1 node_lifecycle_controller.go:917] Node XXX is unresponsive as of 2021-06-09 01:00:48.962450508 +0000 UTC m=+5151508.967069635. Adding it to the Taint queue.
node_lifecycle_controller.go:180] deleting node since it is no longer present in cloud provider: XXX

kube-scheduler 日志

node_tree.go:113] Removed node "XXX" in group "us-east-2:\x00:us-east-2b" from NodeTree

我检查了 kubelet 日志，但没有任何消息将节点移动到 NotReady 状态.我期待至少在 kubelet 日志中看到此消息 - https://github.com/kubernetes/kubernetes/blob/e9de1b0221dd8687aba527e682fafc7c33370c09/pkg/kubelet/kubelet_node_status.go#L682

I checked the kubelet logs but it does not have any message moving the node to NotReady status. I was expecting to atleast see this message in the kubelet log - https://github.com/kubernetes/kubernetes/blob/e9de1b0221dd8687aba527e682fafc7c33370c09/pkg/kubelet/kubelet_node_status.go#L682

这让我想知道 kubelet 是否死了，或者节点无法访问，或者从 kube-api-server 到该节点上的 kubelet 的任何连接丢失.

Which makes me wonder if the kubelet dies or the node is not reachable or any connection lost from kube-api-server to kubelet on that node.

我已经为此努力了好几天来调试这个问题，但没有成功.

注意:在 Kubernetes 中运行的批处理作业最终会在重新启动时成功运行.此外，此问题是偶发的，即有时会重新启动，有时不会并在第一次运行时完成.

Note: The batch job running in Kubernetes do run successfully eventually on restart. Also this issue is sporadic i.e sometime the restart happens and sometimes it does not and finishes in the first run.

运行批处理作业时，EKS 节点移动到 NodeNotReady 状态 [英] EKS node moves to NodeNotReady state when running a batch jobs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

运行批处理作业时，EKS 节点移动到 NodeNotReady 状态 [英] EKS node moves to NodeNotReady state when running a batch jobs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭