PromQL 查询以查找上周使用的 CPU 和内存 [英] PromQL query to find CPU and memory used for the last week

查看:318
本文介绍了PromQL 查询以查找上周使用的 CPU 和内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个 Prometheus 查询,它可以告诉我每个命名空间在一个时间范围内(比如一周)使用了多少 CPU(以及另一个用于内存和网络的百分比).

我尝试使用的指标是 container_spec_cpu_sharescontainer_memory_working_set_bytes,但我不知道随着时间的推移如何对它们求和.无论我尝试返回 0 还是错误.

有关如何为此编写查询的任何帮助将不胜感激.

解决方案

要检查每个命名空间使用的内存百分比,您需要一个类似于以下的查询:

sum( container_memory_working_set_bytes{container="", namespace=~".+"} )|by(命名空间)/忽略(命名空间)group_left总和(machine_memory_bytes{})* 100

上面的查询应该产生一个类似于这个的图:

<块引用>

免责声明!:

  • 上面的屏幕截图来自 Grafana,以提高可见度.
  • 此查询不确认可用 RAM 的更改(节点更改、节点自动缩放等).

要在 PromQL 中获取一段时间内的指标,您需要使用其他函数,例如:

  • avg_over_time(EXP[time]).

要回到过去并计算特定时间点的资源,您需要使用:

  • 偏移时间

使用上面的指针查询应该结合:

avg_over_time( sum(container_memory_working_set_bytes{container="", namespace=~".+"} offset 45m) by (namespace)[120m:])/ignoring (namespace) group_left总和(machine_memory_bytes{})

上述查询将计算每个命名空间使用的平均内存百分比,并将其除以集群中从 120 分钟到当前时间的所有内存.它也将从当前时间提前 45 分钟开始.

示例:

  • 运行查询时间:20:00
  • avg_over_time(EXPR[2h:])
  • 偏移 45 分钟

以上示例将从 17:15 开始,并将查询运行到 19:15.您可以修改它以包括整个星期:)

如果您想按命名空间计算 CPU 使用率,您可以将此指标替换为以下指标:

  • container_cpu_usage_seconds_total{} - 使用此指标(计数器)时请检查 rate() 函数
  • machine_cpu_cores{}

您还可以查看此网络指标:

  • container_network_receive_bytes_total - 使用此指标(计数器)时请检查 rate() 函数
  • container_network_transmit_bytes_total - 使用此指标(计数器)时请检查 rate() 函数

我在下面包含了更多解释,包括示例(内存)、测试方法和使用查询的剖析.


让我们假设:

  • Kubernetes 集群 1.18.6 (Kubespray) 总内存为 12GB:
    • 具有 2GB 内存的主节点
    • 具有 8GB 内存的 worker-one 节点
    • 具有 2GB 内存的工人二节点
  • Prometheus 和 Grafana 安装:

    使用avg_over_time(EXPR[time:])/集群内存计算显示使用率在13%左右((17.5+8.5)/2) 查询人工负载产生的时间时.这应该表明查询是正确的:


    至于使用的查询:

    avg_over_time( sum( container_memory_working_set_bytes{container="", namespace="kruk"} offset 1380m )by (namespace)[120m:])/ignoring (namespace) group_left总和(machine_memory_bytes{})* 100

    上面的查询与开头的查询非常相似,但我做了一些更改以仅显示 kruk 命名空间.

    我将查询解释分为两部分(除数/除数).

    股息

    container_memory_working_set_bytes{container="", namespace="kruk"}

    该指标将输出命名空间 kruk 中的内存使用记录.如果您要查询所有名称空间,请查看附加说明:

    • namespace=~".+" <- 只有当命名空间键中的值包含 1 个或多个字符时,此正则表达式才会匹配.这是为了避免带有聚合指标的空命名空间结果.
    • container="" <- 部分用于过滤指标.如果您在没有它的情况下进行查询,您将获得每个容器/pod 的多个内存使用指标,如下所示.container="" 仅在容器值为空时才匹配(下面引用的最后一行).

    container_memory_working_set_bytes{container="POD",endpoint="https-metrics",id="/kubepods/podab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b/e249c12010a27f82389ebfff3c7c133f2a5da19799d2f5bb794bcdb5dc5f8bca",image="k8s.gcr.io/pause:3.2",instance="19799d2f5bb794bcdb5dc5f8bca",instance="1902","student="1902","student="1902"192","student="1902"192.1"k8s_POD_ubuntu_kruk_ab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b_0",namespace="kruk",node="worker-one",pod="ubuntu",servicelet="2"2container_memory_working_set_bytes {容器= QUOT; ubuntu的",端点= QUOT; HTTPS度量",ID ="/kubepods/podab1ed1fb-dc8c-47分贝-acc8-4a01e3f9ea1b/fae287e7043ff00da16b6e6a8688bfba0bfe30634c52e7563fcf18ac5850f6d9",图像= QUOT; ubuntu的@ SHA256:5d1d5407f353843ecf8b16524bc5565aa332e9e6a1297c73a92d3e754b8a636d",实例=192.168.0.124:10250",job=kubelet",metrics_path=/metrics/cadvisor",name=k8s_ubuntu_ubuntu_kruk_ab1ed1fb-dc8c-47a-dc8c-47a-dc8c-47a-dc8c-47a-dc8c-47a0"acc8"acc8"acc8"空间=worker-one",pod =ubuntu",服务 =kubelet"} 2186403840container_memory_working_set_bytes{endpoint="https-metrics",id="/kubepods/podab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b",instance="192.168.0.124:""_bepath"=",metric=";/metrics/cadvisor",namespace="kruk",node="worker-one",pod="ubuntu",service="kubelet"} 2187096064

    <块引用>

    您可以在此处阅读有关暂停容器的更多信息:

    sum( container_memory_working_set_bytes{container="", namespace="kruk"} offset 1380m )通过(命名空间)

    此查询将按各自的命名空间对结果求和.offset 1380m 用于回到过去进行的测试.

    avg_over_time( sum( container_memory_working_set_bytes{container="", namespace="kruk"} offset 1380m )通过(命名空间)[120m:])

    此查询将从比当前时间早 1380m 开始的指定时间(120m 到现在)内跨命名空间的内存指标计算平均值.

    您可以在此处阅读有关 avg_over_time() 的更多信息:

    除数

    sum( machine_memory_bytes{})

    该指标将对集群中每个节点的可用内存求和.

    EXPR/忽略(命名空间)group_left总和(machine_memory_bytes{})* 100

    专注于:

    • /ignoring (namespace) group_left <- 这个表达式允许你将每个记录"分开在除数(集群中的所有内存)的除数(每个命名空间及其内存平均跨时间)中.您可以在此处阅读更多相关信息:Prometheus.io: Vector匹配
    • * 100 是不言自明的,会将结果乘以 100 以看起来更像百分比.

    其他资源:

    I'm trying to write a Prometheus query that can tell me how much, as a percentage, CPU (and another for memory and network) each namespace has used over a time frame, say a week.

    The metrics I'm trying to use are container_spec_cpu_shares and container_memory_working_set_bytes but I can't figure out how sum them over time. Whatever I try either returns 0 or errors.

    Any help on how to write a query for this would be greatly appreciated.

    解决方案

    To check the percentage of memory used by each namespace you will need a query similar to the one below:

    sum( container_memory_working_set_bytes{container="", namespace=~".+"} )|
    by (namespace) / ignoring (namespace) group_left 
    sum( machine_memory_bytes{}) * 100 
    

    Above query should produce a graph similar to this one:

    Disclaimers!:

    • The screenshot above is from Grafana for better visibility.
    • This query does not acknowledge changes in available RAM (changes in nodes, autoscaling of nodes, etc.).

    To get the metric over a period of time in PromQL you will need to use additional function like:

    • avg_over_time(EXP[time]).

    To go back in time and calculate resources from specific point in time you will need to use:

    • offset TIME

    Using above pointers query should combine to:

    avg_over_time( sum(container_memory_working_set_bytes{container="", namespace=~".+"} offset 45m) by (namespace)[120m:])  / ignoring (namespace) group_left 
    sum( machine_memory_bytes{}) 
    

    Above query will calculate the average percentage of memory used by each namespace and divide it by all memory in the cluster in the span of 120 minutes to present time. It will also start 45 minutes earlier from present time.

    Example:

    • Time of running the query: 20:00
    • avg_over_time(EXPR[2h:])
    • offset 45 min

    Above example will start at 17:15 and it will run the query to the 19:15. You can modify it to include the whole week :).

    If you want to calculate the CPU usage by namespace you can replace this metrics with the one below:

    • container_cpu_usage_seconds_total{} - please check rate() function when using this metric (counter)
    • machine_cpu_cores{}

    You could also look on this network metrics:

    • container_network_receive_bytes_total - please check rate() function when using this metric (counter)
    • container_network_transmit_bytes_total - please check rate() function when using this metric (counter)

    I've included more explanation below with examples (memory), methodology of testing and dissection of used queries.


    Let's assume:

    • Kubernetes cluster 1.18.6 (Kubespray) with 12GB of memory in total:
      • master node with 2GB of memory
      • worker-one node with 8GB of memory
      • worker-two node with 2GB of memory
    • Prometheus and Grafana installed with: Github.com: Coreos: Kube-prometheus
    • Namespace kruk with single ubuntu pod set to generate artificial load with below command:
      • $ stress-ng --vm 1 --vm-bytes <AMOUNT_OF_RAM_USED> --vm-method all -t 60m -v

    The artificial load was generated with stress-ng two times:

    • 60 minutes - 1GB of memory used
    • 60 minutes - 2GB of memory used

    The percentage of memory used by namespace kruk in this timespan:

    • 1GB which accounts for about ~8.5% of all memory in the cluster (12GB)
    • 2GB which accounts for about ~17.5% of all memory in the cluster (12GB)

    The load from Prometheus query for kruk namespace was looking like that:

    Calculation using avg_over_time(EXPR[time:]) / memory in the cluster showed the usage in the midst of about 13% ((17.5+8.5)/2) when querying the time the artificial load was generated. This should indicate that the query was correct:


    As for the used query:

    avg_over_time( sum( container_memory_working_set_bytes{container="", namespace="kruk"} offset 1380m )
    by (namespace)[120m:]) / ignoring (namespace) group_left 
    sum( machine_memory_bytes{}) * 100 
    

    Above query is really similar to the one in the beginning but I've made some changes to show only the kruk namespace.

    I divided the query explanation on 2 parts (dividend/divisor).

    Dividend

    container_memory_working_set_bytes{container="", namespace="kruk"}
    

    This metric will output records of memory usage in namespace kruk. If you were to query for all namespaces look on additional explanation:

    • namespace=~".+" <- this regexp will match only when the value inside of namespace key is containing 1 or more characters. This is to avoid empty namespace result with aggregated metrics.
    • container="" <- part is used to filter the metrics. If you were to query without it you would get multiple memory usage metrics for each container/pod like below. container="" will match only when container value is empty (last row in below citation).

    container_memory_working_set_bytes{container="POD",endpoint="https-metrics",id="/kubepods/podab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b/e249c12010a27f82389ebfff3c7c133f2a5da19799d2f5bb794bcdb5dc5f8bca",image="k8s.gcr.io/pause:3.2",instance="192.168.0.124:10250",job="kubelet",metrics_path="/metrics/cadvisor",name="k8s_POD_ubuntu_kruk_ab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b_0",namespace="kruk",node="worker-one",pod="ubuntu",service="kubelet"} 692224
    container_memory_working_set_bytes{container="ubuntu",endpoint="https-metrics",id="/kubepods/podab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b/fae287e7043ff00da16b6e6a8688bfba0bfe30634c52e7563fcf18ac5850f6d9",image="ubuntu@sha256:5d1d5407f353843ecf8b16524bc5565aa332e9e6a1297c73a92d3e754b8a636d",instance="192.168.0.124:10250",job="kubelet",metrics_path="/metrics/cadvisor",name="k8s_ubuntu_ubuntu_kruk_ab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b_0",namespace="kruk",node="worker-one",pod="ubuntu",service="kubelet"} 2186403840
    container_memory_working_set_bytes{endpoint="https-metrics",id="/kubepods/podab1ed1fb-dc8c-47db-acc8-4a01e3f9ea1b",instance="192.168.0.124:10250",job="kubelet",metrics_path="/metrics/cadvisor",namespace="kruk",node="worker-one",pod="ubuntu",service="kubelet"} 2187096064
    

    You can read more about pause container here:

    sum( container_memory_working_set_bytes{container="", namespace="kruk"} offset 1380m )
    by (namespace)
    

    This query will sum the results by their respective namespaces. offset 1380m is used to go back in time as the tests were made in the past.

    avg_over_time( sum( container_memory_working_set_bytes{container="", namespace="kruk"} offset 1380m )
    by (namespace)[120m:])
    

    This query will calculate average from memory metric across namespaces in the specified time (120m to now) starting 1380m earlier than present time.

    You can read more about avg_over_time() here:

    Divisor

    sum( machine_memory_bytes{})
    

    This metric will sum the memory available in each node in the cluster.

    EXPR / ignoring (namespace) group_left 
    sum( machine_memory_bytes{}) * 100 
    

    Focusing on:

    • / ignoring (namespace) group_left <- this expression will allow you to divide each "record" in the dividend (each namespace with their memory average across time) by a divisor (all memory in the cluster). You can read more about it here: Prometheus.io: Vector matching
    • * 100 is rather self explanatory and will multiply the result by a 100 to look more like percentages.

    Additional resources:

    这篇关于PromQL 查询以查找上周使用的 CPU 和内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆