运行批处理作业时,EKS 节点移动到 NodeNotReady 状态 [英] EKS node moves to NodeNotReady state when running a batch jobs

查看:28
本文介绍了运行批处理作业时,EKS 节点移动到 NodeNotReady 状态的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在我的 EKS 集群中运行一个批处理作业来训练 ML 模型,并且训练持续了 8-10 个小时.但是,似乎运行作业的节点被杀死,作业在新节点上重新启动.我正在 Prometheus 中监控节点,似乎没有 CPU 或 OOM 问题.

I am running a batch job in my EKS cluster that trains a ML model and the training goes on for 8-10hours. However, it seems like the node on which the job runs moves is killed and the job is restarted on a new node. I am monitoring the Node in Prometheus and seems like there was no CPU or OOM issue.

我的下一个赌注是查看 EKS cloudtrail 日志,当节点被删除时,我会看到以下事件 -

My next bet was to look into the EKS cloudtrail logs and right when the node is removed I see below events -

  • kube-controller-manager 日志
controller_utils.go:179] Recording status change NodeNotReady event message for node XXX
controller_utils.go:121] Update ready status of pods on node [XXX]
event.go:274] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"XXX", UID:"1bf33ec8-41cd-434a-8579-3ba4b8cdd5f1", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node XXX status is now: NodeNotReady
node_lifecycle_controller.go:917] Node XXX is unresponsive as of 2021-06-09 01:00:48.962450508 +0000 UTC m=+5151508.967069635. Adding it to the Taint queue.
I0609 01:00:48.962465 1 node_lifecycle_controller.go:917] Node XXX is unresponsive as of 2021-06-09 01:00:48.962450508 +0000 UTC m=+5151508.967069635. Adding it to the Taint queue.
node_lifecycle_controller.go:180] deleting node since it is no longer present in cloud provider: XXX

  • kube-scheduler 日志
  • node_tree.go:113] Removed node "XXX" in group "us-east-2:\x00:us-east-2b" from NodeTree
    

    我检查了 kubelet 日志,但没有任何消息将节点移动到 NotReady 状态.我期待至少在 kubelet 日志中看到此消息 - https://github.com/kubernetes/kubernetes/blob/e9de1b0221dd8687aba527e682fafc7c33370c09/pkg/kubelet/kubelet_node_status.go#L682

    I checked the kubelet logs but it does not have any message moving the node to NotReady status. I was expecting to atleast see this message in the kubelet log - https://github.com/kubernetes/kubernetes/blob/e9de1b0221dd8687aba527e682fafc7c33370c09/pkg/kubelet/kubelet_node_status.go#L682

    这让我想知道 kubelet 是否死了,或者节点无法访问,或者从 kube-api-server 到该节点上的 kubelet 的任何连接丢失.

    Which makes me wonder if the kubelet dies or the node is not reachable or any connection lost from kube-api-server to kubelet on that node.

    我已经为此努力了好几天来调试这个问题,但没有成功.

    注意:在 Kubernetes 中运行的批处理作业最终会在重新启动时成功运行.此外,此问题是偶发的,即有时会重新启动,有时不会并在第一次运行时完成.

    Note: The batch job running in Kubernetes do run successfully eventually on restart. Also this issue is sporadic i.e sometime the restart happens and sometimes it does not and finishes in the first run.

    推荐答案

    您是否在使用 Spot 实例节点?这可能是节点根据现货/出价变化而终止的原因之一.尝试专用实例.

    Are you using spot instance nodes? That might be one of the reason where the node gets terminated based on the spot / bid price changes. Try dedicated instance.

    这篇关于运行批处理作业时,EKS 节点移动到 NodeNotReady 状态的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆