Grafana-大计数器重置后的单一状态 [英] Grafana - Single stat after big counter reset

查看:75
本文介绍了Grafana-大计数器重置后的单一状态的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们使用Grafana + Prometheus来监视我们的基础架构,最近我们添加了一些以业务为中心的指标,而我一直无法跟踪我们所跟踪的计数器之一.这是一个会话时间计数器.基本上,每次会话结束时,我们都会根据用户在该会话中花费的时间来增加该计数器.因此,如果用户使用该软件花费2m,则计数器将增加120000 ms.几天以来,这种方法非常有效,但是从昨天开始,当我们在一个实例计数器与其余实例之间存在很大差异时,由于重新启动了部分服务,该大计数器被重置,因此我无法获得有意义的单一状态面板.

We use Grafana + Prometheus to monitor our infrastructure and recently we added some business focused metrics and I've been having issues with one of the counters we track. It's a session time counter. Basically, each time a session ends, we increase that counter by the time the user spent in that session. So if an user spends 2m using the software, the counter will be incremented by 120000 ms. For a few days that approach worked perfectly fine, but since yesterday when we had a big discrepancy between one instance counter and the rest of them, and that big counter was reset due to part of the service being restarted, I can't get a meaningful single stat panel anymore.

下面是发生的情况的图表(此计数器有3个标签,导致> 50个标签组合)

Here's a graph of what happened (this counter has 3 labels that result in >50 label combinations)

普罗米修斯图

此计数器跟踪的当前所有时间总计为13.8年,为期4天,但是自从计数器重置后,我的单个统计信息指标已达到-20年(使用差异)或35年(使用范围)24小时内.如果您不考虑计数器重置,这不会出错,因为diff和range会查看最小值/最大值/第一个/当前值,但这已不再是有用的指标.​​

The current all time total tracked by this counter is 13.8 years for a 4 day period, but since the counter reset, my single stat metrics have been either -20 years (using diff) or 35 years (using range) for a 24h period. This is not wrong if you don't account for the counter reset, since diff and range will look at min/max/first/current values, but it's not an useful metric anymore.

如果我将时间范围设置为不包括计数器重置,则Diff和Range都将显示非常接近预期值的值(我们的用法非常线性且可预测).

If I set the timeframe to not include the counter reset, both Diff and Range show very close values to what is expected (our usage is very linear and predictable).

singlestat面板公式如下

The singlestat panel formula looks like this

sum(dyno_app_music_total_user_listen_time{server=~"[[server]]", clusterId=~"[[clusterid]]"})

我该如何处理计数器中针对单状态指标的重置?

How can I handle resets in a counter for a singlestat metric?

推荐答案

我不确定我是否完全理解您的问题,但是如果我不得不总结一下,我理解的是您有一个带有3个标签的指标(结果为50不同的时间序列),并且您想要显示一个单一状态面板,该面板将所有时间的所有这些计数器加在一起.

I'm not sure I fully understand your question, but if I had to summarize what I understood is that you have a metric with 3 labels (resulting in 50 different timeseries) and you want to display a singlestat panel that sums all those counters together across all of time.

在Prometheus中处理计数器重置的方法是使用 rate(),或者在需要绝对值 increase()的情况下.因此,您编写查询的方式(假设您希望一直增加计数器的总和)是:

The way you handle counter resets in Prometheus is by using rate() or, in case you want an absolute value increase(). So the way you would write your query (assuming you wanted the sum of counter increases for all time) is:

sum(increase(dyno_app_music_total_user_listen_time{...}[100y]))

但是请注意,随着时间的流逝,速度将越来越慢,因为Prometheus必须在执行计算之前一直返回并加载所有时间序列的50个时间序列.(以至于加载的样本数量将超过Prometheus中配置的限制或可用内存量.)

Do note however that this is going to get slower and slower over time, because Prometheus will have to go back and load your 50 timeseries for all time before doing the calculation. (To the point where the number of samples loaded will exceed either the limit configured in Prometheus or the amount of memory available).

可能更有用的是(随着时间的流逝,您会摆脱昨天"经历的高峰),而是显示一段较短时间内的计数器变化率图表:

What may be more useful than that (and would over time get rid of the spike you experienced "yesterday") is to instead show a graph of the rate of change of your counters over some much shorter time range:

sum(rate(dyno_app_music_total_user_listen_time{...}[1h]))

这将向您显示一个小时的平均会话次数(大约),您可以选择在图表上显示的任何时间范围.

This would show you (an approximation of) the average number of sessions over the previous hour for any time range you may choose to display on your graph.

这篇关于Grafana-大计数器重置后的单一状态的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆