Kubernetes在调度Pod时是否会考虑当前的内存使用情况 [英] Does Kubernetes consider the current memory usage when scheduling pods

查看:114
本文介绍了Kubernetes在调度Pod时是否会考虑当前的内存使用情况的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

https://kubernetes.io/上的Kubernetes文档docs/concepts/configuration/manage-compute-resources-container/状态:

调度程序确保对于每种资源类型,已调度容器的资源请求总数小于节点的容量.

The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node.

Kubernetes在计算容量时是否考虑节点的当前状态?为了突出我的意思,下面是一个具体示例:

Does Kubernetes consider the current state of the node when calculating capacity? To highlight what I mean, here is a concrete example:

假设我有一个具有10Gi RAM的节点,每个节点运行10个Pod,每个请求具有500Mi的资源请求,并且没有限制.假设它们是突发"的,每个Pod实际上都在使用1Gi RAM.在这种情况下,节点已被充分利用(10 x 1Gi = 10Gi),但是资源请求仅是10 x 500Mi = 5Gi. Kubernetes是否会考虑在该节点上调度另一个Pod,因为该节点上只有50%的内存容量已被使用requested,还是会使用以下事实:当前正在使用100%的内存,并且该节点已满容量?

Assuming I have a node with 10Gi of RAM, running 10 Pods each with 500Mi of resource requests, and no limits. Let's say they are "bursting", and each Pod is actually using 1Gi of RAM. In this case, the node is fully utilized (10 x 1Gi = 10Gi), but the resources requests are only 10 x 500Mi = 5Gi. Would Kubernetes consider scheduling another pod on this node because only 50% of the memory capacity on the node has been requested, or would it use the fact that 100% of the memory is currently being utilized, and the node is at full capacity?

推荐答案

默认情况下,kubernetes将使用cgroup来管理和监视节点上Pod的可分配"内存.可以将kubelet配置为完全依赖部署中的静态预订和pod 请求,因此该方法取决于您的集群部署.

By default kubernetes will use cgroups to manage and monitor the "allocatable" memory on a node for pods. It is possible to configure kubelet to entirely rely on the static reservations and pod requests from your deployments though so the method depends on your cluster deployment.

在任何一种情况下,节点本身都会跟踪内存压力",该压力会监视节点现有的整体内存使用情况.如果节点处于内存压力下,则不会安排新的Pod,也不会驱逐现有Pod.

In either case, a node itself will track "memory pressure", which monitors the existing overall memory usage of a node. If a node is under memory pressure then no new pods will be scheduled and existing pods will be evicted.

最好为所有工作负载设置合理的内存请求限制,以尽可能地帮助调度程序. 如果kubernetes部署未配置cgroup内存监视,则对 all 工作负载必须设置 requests . 如果部署使用的是cgroup内存监视,则至少设置 requests 可以为调度程序提供更多有关要调度的Pod是否适合节点的详细信息.

It's best to set sensible memory requests and limits for all workloads to help the scheduler as much as possible. If a kubernetes deployment does not configure cgroup memory monitoring, setting requests is a requirement for all workloads. If the deployment is using cgroup memory monitoring, at least setting requests give the scheduler extra detail as to whether the pods to be scheduled should fit on a node.

Kubernetes储备计算资源docco 具有一个很好地概述了如何在节点上查看内存.

The Kubernetes Reserve Compute Resources docco has a good overview of how memory is viewed on a node.

      Node Capacity
---------------------------
|     kube-reserved       |
|-------------------------|
|     system-reserved     |
|-------------------------|
|    eviction-threshold   |
|-------------------------|
|                         |
|      allocatable        |
|   (available for pods)  |
|                         |
|                         |
---------------------------

默认调度程序检查节点是否没有内存压力,然后查看节点上可用的可分配内存,以及新的pod requests 是否适合其中.

The default scheduler checks a node isn't under memory pressure, then looks at the allocatable memory available on a node and whether the new pods requests will fit in it.

可分配的可用内存是total-available-memory - kube-reserved - system-reserved - eviction-threshold - scheduled-pods.

scheduled-pods的值可以通过动态cgroup或通过pod 资源请求静态计算.

The value for scheduled-pods can be calculated via a dynamic cgroup, or statically via the pods resource requests.

默认为true的kubelet --cgroups-per-qos选项启用cgroup对计划的Pod进行跟踪. kubernetes运行的豆荚将位于

The kubelet --cgroups-per-qos option which defaults to true enables cgroup tracking of scheduled pods. The pods kubernetes runs will be in

如果为--cgroups-per-qos=false,则只能通过在节点上计划的资源请求减少可分配的内存.

If --cgroups-per-qos=false then the allocatable memory will only be reduced by the resource requests that scheduled on a node.

eviction-threshold是Kubernetes开始驱逐Pod时的可用内存级别.默认值为100MB,但可以通过kubelet命令行设置.此设置既适用于节点的 allocatable 值,也适用于下一部分中节点的内存压力状态.

The eviction-threshold is the level of free memory when Kubernetes starts evicting pods. This defaults to 100MB but can be set via the kubelet command line. This setting is teid to both the allocatable value for a node and also the memory pressure state of a node in the next section.

kubelets system-reserved的值可以配置为静态值(--system-reserved=),也可以通过cgroup(--system-reserved-cgroup=)动态监视. 这适用于在kubernetes(sshdsystemd等)之外运行的所有系统守护程序.如果配置cgroup,则所有进程都需要放置在该cgroup中.

kubelets system-reserved value can be configured as a static value (--system-reserved=) or monitored dynamically via cgroup (--system-reserved-cgroup=). This is for any system daemons running outside of kubernetes (sshd, systemd etc). If you configure a cgroup, the processes all need to be placed in that cgroup.

kubelets kube-reserved值配置为静态值(通过--kube-reserved=),也可以通过cgroup(--kube-reserved-cgroup=)动态地对其进行监视. 这适用于在kubernetes之外运行的任何kubernetes服务,通常是kubelet和容器运行时.

kubelets kube-reserved value can be configured as a static value (via --kube-reserved=) or monitored dynamically via cgroup (--kube-reserved-cgroup=). This is for any kubernetes services running outside of kubernetes, usually kubelet and a container runtime.

容量存储在Node对象中.

Capacity is stored in the Node object.

$ kubectl get node node01 -o json | jq '.status.capacity'
{
  "cpu": "2",
  "ephemeral-storage": "61252420Ki",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "4042284Ki",
  "pods": "110"
}

可分配的值可以在节点上找到,您可以注意到现有用法不会更改此值.只有带有资源请求的预定pod才会占用allocatable值.

The allocatable value can be found on the Node, you can note than existing usage doesn't change this value. Only schduleding pods with resource requests will take away from the allocatable value.

$ kubectl get node node01 -o json | jq '.status.allocatable'
{
  "cpu": "2",
  "ephemeral-storage": "56450230179",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "3939884Ki",
  "pods": "110"
}

内存使用率和压力

一个kube节点也可能发生内存压力"事件.该检查是在上述 allocatable 资源检查之外进行的,它更多地是系统级的.内存压力着眼于当前根cgroup内存使用量减去不活动的文件高速缓存/缓冲区,类似于计算free进行的操作以删除文件高速缓存.

Memory Usage and Pressure

A kube node can also have a "memory pressure" event. This check is done outside of the allocatable resource checks above and is more a system level catch all. Memory pressure looks at the current root cgroup memory usage minus the inactive file cache/buffers, similar to the calculation free does to remove the file cache.

处于内存压力下的节点将不会安排Pod,并且将积极尝试驱逐现有的Pod,直到解决内存压力状态为止.

A node under memory pressure will not have pods scheduled, and will actively try and evict existing pods until the memory pressure state is resolved.

您可以通过--eviction-hard=[memory.available<500Mi]标志设置内存kubelet的逐出阈值量,使其保持可用状态.豆荚的内存请求和使用情况可以帮助告知驱逐过程.

You can set the eviction threshold amount of memory kubelet will maintain available via the --eviction-hard=[memory.available<500Mi] flag. The memory requests and usage for pods can help informs the eviction process.

kubectl top node将为您提供每个节点的现有内存统计信息(如果您正在运行度量服务).

kubectl top node will give you the existing memory stats for each node (if you have a metrics service running).

$ kubectl top node
NAME                 CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
node01               141m         7%     865Mi           22%       

如果您没有使用cgroups-per-qos且有许多没有资源限制的Pod,或者有许多系统守护程序,那么集群可能会在 allocatable 的内存受限系统上进行调度时遇到一些问题会很高,但实际值可能会很低.

If you were not using cgroups-per-qos and a number of pods without resource limits, or a number of system daemons then the cluster is likely to have some problems scheduling on a memory constrained system as allocatable will be high but the actual value might be really low.

Kubernetes 资源管理文档不足包括一个脚本,它模拟kubelets内存监视流程:

Kubernetes Out Of Resource Handling docco includes a script which emulates kubelets memory monitoring process:

# This script reproduces what the kubelet does
# to calculate memory.available relative to root cgroup.

# current memory usage
memory_capacity_in_kb=$(cat /proc/meminfo | grep MemTotal | awk '{print $2}')
memory_capacity_in_bytes=$((memory_capacity_in_kb * 1024))
memory_usage_in_bytes=$(cat /sys/fs/cgroup/memory/memory.usage_in_bytes)
memory_total_inactive_file=$(cat /sys/fs/cgroup/memory/memory.stat | grep total_inactive_file | awk '{print $2}')

memory_working_set=${memory_usage_in_bytes}
if [ "$memory_working_set" -lt "$memory_total_inactive_file" ];
then
    memory_working_set=0
else
    memory_working_set=$((memory_usage_in_bytes - memory_total_inactive_file))
fi

memory_available_in_bytes=$((memory_capacity_in_bytes - memory_working_set))
memory_available_in_kb=$((memory_available_in_bytes / 1024))
memory_available_in_mb=$((memory_available_in_kb / 1024))

echo "memory.capacity_in_bytes $memory_capacity_in_bytes"
echo "memory.usage_in_bytes $memory_usage_in_bytes"
echo "memory.total_inactive_file $memory_total_inactive_file"
echo "memory.working_set $memory_working_set"
echo "memory.available_in_bytes $memory_available_in_bytes"
echo "memory.available_in_kb $memory_available_in_kb"
echo "memory.available_in_mb $memory_available_in_mb"

这篇关于Kubernetes在调度Pod时是否会考虑当前的内存使用情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆