Alertmanager,不同警报规则的不同间隔 [英] Alertmanager, different interval for different alert rules
问题描述
我正在使用 alertmanager 来获取 prometheus 指标的警报,我对不同的指标有不同的警报规则,是否可以为每个警报规则设置不同的间隔,例如对于 metric1,我有规则 1,我需要检查此规则每天的基本间隔,对于 metric2 我有规则 2,这个应该每 2 小时检查一次,
for: 5m
属性用于确保规则在触发警报之前连续 X 分钟返回 true.例如,如果 cpu 使用率峰值持续 30 秒,则不会触发警报,因为我们将 for
属性设置为 5 分钟.因此,这不适合您.
我相信您可以使用警报管理器的 repeat_interval
来设置发送通知的时间间隔.然后你有警报,但你根据你的 repeat_interval
触发/触发它.此
I'm using alertmanager to get alerts for prometheus metrics, I have different alert rules for different metrics, is it possible to set different interval for each alert rules, for example for metric1 I have rule1 and I need to check this rule on daily base interval, and for metric2 I have rule2 and this one should be check every 2 hours,
The for: 5m
property is used to ensure that the rule returns true for X continuous minutes before to trigger the alert. For example, in case that there is a spike in cpu usage for 30 seconds, the alert will not be triggered because we set the for
property to 5 minutes. Hence this is not the right property for you.
I believe that you can use the repeat_interval
of the alert manager to set the time interval to send notifications. Then you have the alert but you fire/trigger it depending on your repeat_interval
. This link explains them in detail.
group_wait
sets how long to initially wait to send a notification for a particular group of alerts.group_interval
dictates how long to wait before sending notifications about new alerts that are added to a group of alerts that have been alerted on beforerepeat_interval
is used to determine the wait time before a firing alert that has already been successfully sent to the receiver is sent again.
In order to put them to work, you have to define label
's for each alert. For example, in my alerts.yml
file I create labels app_type: server
and app_type: service
:
groups:
- name: monitor_cpu
rules:
- alert: job:node_cpu_usage:percentage_gt_50
expr: 100 * node_cpu_seconds_total{mode="user"} / ignoring(mode) group_left sum(node_cpu_seconds_total) without(mode) > 5.5
for: 1m
labels:
severity: critical
app_type: server
annotations:
summary: "High CPU usage"
description: "Server {{ $labels.instance }} has high CPU usage."
- name: targets
rules:
- alert: monitor_service_down
expr: up == 0
for: 1m
labels:
severity: critical
app_type: service
annotations:
summary: "Monitor service non-operational"
description: "Service {{ $labels.instance }} is down."
then I create a route tree to send notifications to different groups by matching the specific label. And here comes the solution that I use. I define different group_wait
, group_interval
, and repeat_interval
for each group. Then you can use the repeat_interval: 1h
and the repeat_interval: 24h
in different routes
leaf:
global:
smtp_from: 'mail@gmail.com'
smtp_smarthost: smtp.gmail.com:587
smtp_auth_username: 'mail@gmail.com'
smtp_auth_identity: 'mail@gmail.com'
smtp_auth_password: ''
route:
receiver: 'admin-team'
routes:
- match_re:
app_type: (server|service)
receiver: 'admin-team'
routes:
- match:
app_type: server
receiver: 'admin-team'
group_wait: 1m
group_interval: 5m
repeat_interval: 1h
- match:
app_type: service
receiver: 'dev-team'
group_wait: 1m
group_interval: 5m
repeat_interval: 24h
receivers:
- name: 'admin-team'
email_configs:
- to: 'admin-mail@gmail.com'
- name: 'dev-team'
email_configs:
- to: 'dev-mail@gmail.com'
Unfortunately, I did not test for 24 hours but with a different gap of minutes and it worked. I think that it will work for long hours as well.
这篇关于Alertmanager,不同警报规则的不同间隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!