高资源需求窗格上的节点状态更改为未知 [英] Node status changes to unknown on a high resource requirement pod

查看:93
本文介绍了高资源需求窗格上的节点状态更改为未知的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个涉及kubernetes插件的Jenkins部署管道.使用kubernetes插件,我创建了一个从属Pod,用于使用 yarn 构建节点应用程序.设置了对CPU和内存的请求和限制.

I have a Jenkins deployment pipeline which involves kubernetes plugin. Using kubernetes plugin I create a slave pod for building a node application using yarn. The requests and limits for CPU and Memory are set.

Jenkins主服务器有时会调度从服务器(从现在起,我还没有看到任何模式),该Pod会使整个节点无法访问,并将节点的状态更改为未知".在Grafana中进行仔细检查后,CPU和内存资源似乎在该范围内,并且没有可见的峰值.唯一出现的峰值是磁盘I/O,峰值约为4 MiB.

When the Jenkins master schedules the slave, sometimes (as I haven’t seen a pattern, as of now), the pod makes the entire node unreachable and changes the status of node to be Unknown. On careful inspection in Grafana, the CPU and Memory Resources seem to be well within the range with no visible spike. The only spike that occurs is with the Disk I/O, which peaks to ~ 4 MiB.

我不确定这是否是节点无法将自身作为群集成员寻址的原因.在这里我需要一些帮助:

I am not sure if that is the reason for the node unable to address itself as a cluster member. I would be needing help in a few things here:

a)如何深入诊断节点离开群集的原因.

a) How to diagnose in depth the reasons for node leaving the cluster.

b)如果原因是磁盘IOPS,是否存在任何默认请求,以及Kubernetes级别的IOPS限制?

b) If, the reason is Disk IOPS, is there any default requests, limits for IOPS at Kubernetes level?

PS:我正在使用EBS(gp2)

PS: I am using EBS (gp2)

推荐答案

按照 docs ,将节点设置为就绪":

As per the docs, for the node to be 'Ready':

如果节点运行状况良好并且准备好接受Pod,则为True;如果节点运行状况不佳并且不接受Pod,则为False;如果节点控制器在最后一个node-monitor-grace-period期间未从节点收到消息,则为Unknown (默认为40秒)

True if the node is healthy and ready to accept pods, False if the node is not healthy and is not accepting pods, and Unknown if the node controller has not heard from the node in the last node-monitor-grace-period (default is 40 seconds)

似乎在运行工作负载时,您的kube-apiserver在40秒内没有收到来自节点(kubelet)的消息.可能有多种原因,您可以尝试以下方法:

If would seem that when you run your workloads your kube-apiserver doesn't hear from your node (kubelet) in 40 seconds. There could be multiple reasons, some things that you can try:

  • 要查看节点中的事件",请运行:

  • To see the 'Events' in your node run:

$ kubectl describe node <node-name>

  • 查看在kube-apiserver上是否发现异常.在主动主运行中:

  • To see if you see anything unusual on your kube-apiserver. On your active master run:

    $ docker logs <container-id-of-kube-apiserver>
    

  • 查看当节点进入未知"状态时,是否在kube-controller-manager上看到任何异常.在主动主运行中:

  • To see if you see anything unusual on your kube-controller-manager when your node goes into 'Unknown' state. On your active master run:

    $ docker logs <container-id-of-kube-controller-manager>
    

  • 增加kube-controller-manager中的--node-monitor-grace-period选项.您可以将其添加到/etc/kubernetes/manifests/kube-controller-manager.yaml的命令行中,然后重新启动kube-controller-manager容器.

  • Increase the --node-monitor-grace-period option in your kube-controller-manager. You can add it to the command line in the /etc/kubernetes/manifests/kube-controller-manager.yaml and restart the kube-controller-manager container.

    当节点处于未知"状态时,您可以ssh进入该节点,看看是否可以到达kubeapi-server?既在<master-ip>:6443端点上,也在kubernetes.default.svc.cluster.local:443端点上.

    When the node is in the 'Unknown' state can you ssh into it and see if you can reach the kubeapi-server? Both on <master-ip>:6443 and also the kubernetes.default.svc.cluster.local:443 endpoints.

    这篇关于高资源需求窗格上的节点状态更改为未知的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆