Alertmanager,不同警报规则的不同间隔 [英] Alertmanager, different interval for different alert rules

查看:101
本文介绍了Alertmanager,不同警报规则的不同间隔的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 alertmanager 来获取 prometheus 指标的警报,我对不同的指标有不同的警报规则,是否可以为每个警报规则设置不同的间隔,例如对于 metric1,我有规则 1,我需要检查此规则每天的基本间隔,对于 metric2 我有规则 2,这个应该每 2 小时检查一次,

解决方案

for: 5m 属性用于确保规则在触发警报之前连续 X 分钟返回 true.例如,如果 cpu 使用率峰值持续 30 秒,则不会触发警报,因为我们将 for 属性设置为 5 分钟.因此,这不适合您.

我相信您可以使用警报管理器的 repeat_interval 来设置发送通知的时间间隔.然后你有警报,但你根据你的 repeat_interval 触发/触发它.此

I'm using alertmanager to get alerts for prometheus metrics, I have different alert rules for different metrics, is it possible to set different interval for each alert rules, for example for metric1 I have rule1 and I need to check this rule on daily base interval, and for metric2 I have rule2 and this one should be check every 2 hours,

解决方案

The for: 5m property is used to ensure that the rule returns true for X continuous minutes before to trigger the alert. For example, in case that there is a spike in cpu usage for 30 seconds, the alert will not be triggered because we set the for property to 5 minutes. Hence this is not the right property for you.

I believe that you can use the repeat_interval of the alert manager to set the time interval to send notifications. Then you have the alert but you fire/trigger it depending on your repeat_interval. This link explains them in detail.

  • group_wait sets how long to initially wait to send a notification for a particular group of alerts.
  • group_interval dictates how long to wait before sending notifications about new alerts that are added to a group of alerts that have been alerted on before
  • repeat_interval is used to determine the wait time before a firing alert that has already been successfully sent to the receiver is sent again.

In order to put them to work, you have to define label's for each alert. For example, in my alerts.yml file I create labels app_type: server and app_type: service:

groups:
- name: monitor_cpu
  rules:
  - alert: job:node_cpu_usage:percentage_gt_50
    expr: 100 * node_cpu_seconds_total{mode="user"} / ignoring(mode) group_left sum(node_cpu_seconds_total) without(mode) > 5.5
    for: 1m
    labels:
      severity: critical
      app_type: server
    annotations:
      summary: "High CPU usage"
      description: "Server {{ $labels.instance }} has high CPU usage."
- name: targets
  rules:
  - alert: monitor_service_down
    expr: up == 0
    for: 1m
    labels:
      severity: critical
      app_type: service
    annotations:
      summary: "Monitor service non-operational"
      description: "Service {{ $labels.instance }} is down."

then I create a route tree to send notifications to different groups by matching the specific label. And here comes the solution that I use. I define different group_wait, group_interval, and repeat_interval for each group. Then you can use the repeat_interval: 1h and the repeat_interval: 24h in different routes leaf:

global:
  smtp_from: 'mail@gmail.com'
  smtp_smarthost: smtp.gmail.com:587
  smtp_auth_username: 'mail@gmail.com'
  smtp_auth_identity: 'mail@gmail.com'
  smtp_auth_password: ''

route:
  receiver: 'admin-team'
  routes:
    - match_re:
        app_type: (server|service)
      receiver: 'admin-team'
      routes:
      - match:
          app_type: server
        receiver: 'admin-team'
        group_wait: 1m
        group_interval: 5m
        repeat_interval: 1h
      - match:
          app_type: service
        receiver: 'dev-team'
        group_wait: 1m
        group_interval: 5m
        repeat_interval: 24h

receivers:
 - name: 'admin-team'
   email_configs:
   - to: 'admin-mail@gmail.com'

 - name: 'dev-team'
   email_configs:
   - to: 'dev-mail@gmail.com'

Unfortunately, I did not test for 24 hours but with a different gap of minutes and it worked. I think that it will work for long hours as well.

这篇关于Alertmanager,不同警报规则的不同间隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆