对丢失的系列/数据发出警报 [英] Alert on missing series/data

查看:93
本文介绍了对丢失的系列/数据发出警报的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解如何在不再取消该指标时让Grafana提醒我.

I'm trying to understand how can I get Grafana alert me when the metric is not being scraped anymore.

我在此示例中使用的指标是 mongodb_instance_uptime_seconds .当实例发生故障时,不再生成该指标,从而导致Prometheus中缺少该指标.当last()查询(A,现在为1m)<时,警报会在上触发.600 .如您所见,目标是在正常运行时间少于5分钟时发出警报.这意味着我想提醒重启和停止,但是当一个实例发生故障时,Grafana不会发出警报,因为实际上 last()值不存在,并且当实例发生故障5分钟以上时,甚至没有报道了.

The metric I'm using for this example is mongodb_instance_uptime_seconds. When the instance goes down, the metric is not generated anymore resulting in the metric missing in Prometheus. At the moment the alert triggers on when last() query(A, 1m, now) < 600. As you can see the goal was to alert when the uptime is below 5minutes. Meaning I want to alert restarts and stops but Grafana won't alert when one instance goes down because the last() value does not exist in fact and when the instance is down for more than 5min it's not even reported anymore.

关于如何前进的任何线索吗?

Any clues on how to move forward?

推荐答案

通常用于确定实例是否被成功抓取的度量标准是 up .它由所有抓取作业自动生成,因此,如果要为任何已关闭的抓取端点发出警报,只需使用查询 up == 0 ,它将显示上次抓取未成功的所有端点.如果您只想为此特定端点发出警报,请使用 up {instance =" mongodb.foo.com,job =" mongo} == 0

The metric that is typically used to determine if an instance is being scraped successfully is up. It is autogenerated by all scrape jobs, so if you want an alert for any scrape endpoint that is down, just use the query up == 0, which will show any endpoints whose last scrape was not successful. If you want to alert only for this specific endpoint, use labels like as up{instance="mongodb.foo.com",job="mongo"} == 0

如果您对使用Alertmanager代替Grafana感兴趣,则规则如下:

If you're ever interested in using Alertmanager instead of Grafana for this, the rule would look like:

groups:
- name: General
  rules:
  - alert: Endpoint_Down
    expr: up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Exporter is down: {{ $labels.instance }}"
      description: "The endpoint {{ $labels.instance }} is not able to be scraped by Prometheus."

这篇关于对丢失的系列/数据发出警报的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆