如何使用 Prometheus 警报规则检测新指标 [英] How to detect a new metrics with Prometheus alerting rule

查看:94
本文介绍了如何使用 Prometheus 警报规则检测新指标的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个用户指标 request_failures.对于每个用户,我都会为指标添加一个唯一的标签值.因此,对于用户 u1,当请求失败两次时,我会得到以下指标:

Say I have a metrics request_failures for users. For each user I add a unique label value to the metrics. So for user u1, when a request failed twice, I get the following metrics:

    request_failures{user_name="u1"} 2

我还有一个规则,当出现新的故障时会触发.其表达式为:

I also have a rule that fires when there are new failures. Its expression is:

    increase(request_failures[1m]) > 0

这适用于已经遇到故障的用户.例如,当 u1 遇到第三次失败时,规则会触发.

This works well for a user that already encountered failures. For example, when u1 encounters the third failure, the rule fires.

当对新用户的请求失败 u2 时,我得到的指标为:

When a request failed for a new user u2, I get the metrics as:

    request_failures{user_name="u1"} 2
    request_failures{user_name="u2"} 1

现在的问题是警报规则不会为 u2 触发.似乎该规则无法识别新指标",尽管所有三个指标都是相同的request_failures,只是标签不同.

Now the problem is that the alert rule doesn't fire for u2. It seems that the rule cannot recognize a "new metrics", although all the three metrics are identically request_failures, just with different labels.

谁能指出我应该如何构建规则?

Anyone can point out how I should construct the rule?

推荐答案

规则不触发的原因是 increase() 函数不认为新创建的计数器为 0在第一次刮擦之前.我没有找到任何来源,但似乎确实如此.

The reason the rule doesn't fire is that the increase() function doesn't consider a counter newly created to be 0 before the first scrape. I didn't find any source on that but it seems to be the case.

因此您要检测两种情况:

Therefore you want to detect two cases:

  • 如果用户遇到问题而他之前没有问题
  • 如果用户在过去 N 分钟内遇到新问题

这可以用相反的逻辑重新表述:

This can be rephrased in the opposite logic:

应该为出现错误的用户触发警报,除非该用户在过去 N 分钟内的错误没有增加

a alert should be triggered for a user with errors unless there was no increase in errors in the last N minutes for this user

这很容易转化为以下 promql:

Which readily translates into the following promql:

rule: request_failures > 0 UNLESS increase(request_failures[1m]) == 0


事后看来,对于 increase() 函数,它不能假设先前的值是 0,因为它是在一个范围内表示的.之前的值可能超出范围并且不等于 0.因此至少有两个点才有值是有意义的.


On hindsight, regarding the increase() function, it cannot assume the previous value is 0 because it is expressed inside a range. The previous value may be out of range and not equal to 0. So it makes sense to have at least two points to have a value.

这篇关于如何使用 Prometheus 警报规则检测新指标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆