如何使用 prometheus 和 node_exporter 获得服务器的整体正常运行时间 [英] How to get overall uptime of a server with prometheus and node_exporter

查看:262
本文介绍了如何使用 prometheus 和 node_exporter 获得服务器的整体正常运行时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个查询来获取上周运行 prometheus 的服务器的平均正常运行时间.它应该是大约 15 小时/周,所以大约 8-10%.

I'm looking for a query to get the average uptime of the server on which prometheus runs over the last week. It should be about 15h/week, so about 8-10 %.

我在 CentOS 7.6.1810 上使用 Prometheus 2.5.0 和 node_exporter.我最有希望的实验是:

I'm using Prometheus 2.5.0 with node_exporter on CentOS 7.6.1810. My most promising experiments would be:

1 - avg_over_time(up{job="prometheus"}[7d])

1 - avg_over_time(up{job="prometheus"}[7d])

这是我在寻找获得平均正常运行时间的方法时发现的,但它给了我正好 1.(我猜它忽略了没有发生擦伤的时间?)

This is what I've found when looking for ways to get average uptimes, but it gives me exactly 1. (My guess is it ignores the times in which no scrapes happened?)

2 - sum_over_time(up{job="prometheus"}[7d]) * 15/604800

2 - sum_over_time(up{job="prometheus"}[7d]) * 15 / 604800

这在技术上可行,但取决于抓取间隔,在我的情况下为 15 秒.我似乎无法找到从 prometheus 的配置中获取所述间隔的方法,因此我必须将其硬编码到查询中.

This technically works, but is dependent on the scrape interval, which is 15s in my case. I can't seem to find a way to get said interval from prometheus' config, so I have to hardcode it into the query.

我也试图找到获取工作的所有开始和结束时间的方法,但迄今为止无济于事.

I've also tried to find ways to get all start and end times of a job, but to no avail thus far.

推荐答案

给你.不要问.(o:

avg_over_time(
  (
    sum without() (up{job="prometheus"})
      or
    (0 * sum_over_time(up{job="prometheus"}[7d]))
  )[7d:5m]
)

一点点解释:

  1. sum without() (up{job="prometheus"}):采用 up 指标(sum without() 部分是否可以在保留所有其他标签的同时去掉指标名称);
  2. 0 * sum_over_time(up{job="prometheus"}[7d]):为每个 up{job="prometheus"} 生成一个零值向量代码>过去一周看到的标签组合(例如,如果您有多个 Prometheus 实例);
  3. 将两者结合起来,这样您就可以获得可用的实际值,缺失的值为零;
  4. [7d:5m]:PromQL 子查询,产生一个跨越 7 天的范围向量,根据前面的表达式有 5 分钟的分辨率;
  5. avg_over_time:取 up 指标随时间变化的平均值,默认填充零,如果缺失.
  1. sum without() (up{job="prometheus"}): take the up metric (the sum without() part is there to get rid of the metric name while keeping all other labels);
  2. 0 * sum_over_time(up{job="prometheus"}[7d]): produces a zero-valued vector for each of the up{job="prometheus"} label combinations seen over the past week (e.g. in case you have multiple Prometheus instances);
  3. or the two together, so you get the actual value where available, zero where missing;
  4. [7d:5m]: PromQL subquery, produces a range vector spanning 7 days, with 5 minute resolution based on the expression preceding it;
  5. avg_over_time: takes an average over time of the up metric with zeroes filled in as defaults, where missing.

您可能还想在该表达式的末尾添加 和 sum_over_time(up{job="prometheus"}[7d],以便仅获得存在于过去 7 天的某个时间点.否则,由于 7 天范围和 7 天子查询的组合,您将获得过去 14 天所有组合的结果.

You may also want to tack on an and sum_over_time(up{job="prometheus"}[7d] to the end of that expression, to only get a result for label combinations that existed at some point over the previous 7 days. Else, because of the combination of 7 days range and 7 days subquery, you'll get results for all combinations over the previous 14 days.

凭任何想象,这都不是一个有效的查询,但它不需要您将抓取间隔硬编码到查询中.按照要求.(o:

It is not an efficient query by any stretch of the imagination, but it does not require you to hardcode your scrape interval into the query. As requested. (o:

这篇关于如何使用 prometheus 和 node_exporter 获得服务器的整体正常运行时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆