警报管理器中许多主机的警报缺失指标 [英] alerting missing metric for many hosts in alertmanager

查看:85
本文介绍了警报管理器中许多主机的警报缺失指标的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多服务器使用 Prometheus 进行监控,每个主机都有相同的指标.

I have many servers that monitors with Prometheus, every host has the same metrics.

我需要一个警报规则,当特定主机上的特定指标(例如 some_metrics)在 5 米后丢失时发出警报.

I need an alert rule that alerts when specific metric(such as some_metrics) missing on specific host after 5m.

我检查了 absentabsent_over_time 但这些函数不会返回缺失指标的标签,例如 iphostname.

I checked absent and absent_over_time but these functions do not return the labels of missing metric such as ip or hostname.

另外我应该声明我不想为每个主机创建规则.

Also I should state that I don't want to create a rule for each host.

我已经搜索过了,但没有找到任何解决方案.

I have searched about it but I don't find any solution.

有什么解决办法吗?

推荐答案

为了获得标签,您需要一个包含所有您想要的标签的指标.通常,一个不错的选择是 up,它也区分缺失的指标和无法达到的目标.

In order to get the labels, you need a metric which has all the labels you want. Usually, a good choice is up which also distinguish between a missing metric and an unreachable target.

如果 up (on a job) 为 1,规则将发出警报,如果实例上存在指标,UNLESS 二元运算符将禁用警报:

The rule will alert if up (on a job) is 1 and the UNLESS binary operator will disable the alert if the metric is present on the instance:

- alert: MissingMetricInFooTarget
  rule: up{job="foo"} == 1 UNLESS ON(instance) some_metrics{job="foo"}

这篇关于警报管理器中许多主机的警报缺失指标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆