如何使用 Prometheus 警报规则检测新指标 [英] How to detect a new metrics with Prometheus alerting rule
问题描述
假设我有一个用户指标 request_failures
.对于每个用户,我都会为指标添加一个唯一的标签值.因此,对于用户 u1,当请求失败两次时,我会得到以下指标:
Say I have a metrics request_failures
for users. For each user I add a unique label value to the metrics. So for user u1, when a request failed twice, I get the following metrics:
request_failures{user_name="u1"} 2
我还有一个规则,当出现新的故障时会触发.其表达式为:
I also have a rule that fires when there are new failures. Its expression is:
increase(request_failures[1m]) > 0
这适用于已经遇到故障的用户.例如,当 u1 遇到第三次失败时,规则会触发.
This works well for a user that already encountered failures. For example, when u1 encounters the third failure, the rule fires.
当对新用户的请求失败 u2 时,我得到的指标为:
When a request failed for a new user u2, I get the metrics as:
request_failures{user_name="u1"} 2
request_failures{user_name="u2"} 1
现在的问题是警报规则不会为 u2 触发.似乎该规则无法识别新指标",尽管所有三个指标都是相同的request_failures,只是标签不同.
Now the problem is that the alert rule doesn't fire for u2. It seems that the rule cannot recognize a "new metrics", although all the three metrics are identically request_failures, just with different labels.
谁能指出我应该如何构建规则?
Anyone can point out how I should construct the rule?
推荐答案
规则不触发的原因是 increase()
函数不认为新创建的计数器为 0在第一次刮擦之前.我没有找到任何来源,但似乎确实如此.
The reason the rule doesn't fire is that the increase()
function doesn't consider a counter newly created to be 0 before the first scrape. I didn't find any source on that but it seems to be the case.
因此您要检测两种情况:
Therefore you want to detect two cases:
- 如果用户遇到问题而他之前没有问题
- 如果用户在过去 N 分钟内遇到新问题
这可以用相反的逻辑重新表述:
This can be rephrased in the opposite logic:
应该为出现错误的用户触发警报,除非该用户在过去 N 分钟内的错误没有增加
a alert should be triggered for a user with errors unless there was no increase in errors in the last N minutes for this user
这很容易转化为以下 promql:
Which readily translates into the following promql:
rule: request_failures > 0 UNLESS increase(request_failures[1m]) == 0
事后看来,对于 increase()
函数,它不能假设先前的值是 0,因为它是在一个范围内表示的.之前的值可能超出范围并且不等于 0.因此至少有两个点才有值是有意义的.
On hindsight, regarding the increase()
function, it cannot assume the previous value is 0 because it is expressed inside a range. The previous value may be out of range and not equal to 0. So it makes sense to have at least two points to have a value.
这篇关于如何使用 Prometheus 警报规则检测新指标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!