DropWizard 指标仪表与计时器 [英] DropWizard Metrics Meters vs Timers

查看:34
本文介绍了DropWizard 指标仪表与计时器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习 DropWizard Metrics 库(以前称为 Coda Hale 指标)并且我我不知道什么时候应该使用 MetersTimers.根据文档:

I am learning the DropWizard Metrics library (formerly Coda Hale metrics) and I am confused as to when I should be using Meters vs Timers. According to the docs:

仪表:仪表测量一组事件发生的速率

Meter: A meter measures the rate at which a set of events occur

和:

计时器:计时器基本上是一种事件持续时间的直方图和其发生率的计量表

Timer: A timer is basically a histogram of the duration of a type of event and a meter of the rate of its occurrence

根据这些定义,我无法辨别它们之间的区别.让我感到困惑的是 Timer 没有按照我期望的方式使用它.对我来说,Timer 就是:一个计时器;它应该测量 start()stop() 之间的时间差异.但看起来 Timers 也捕捉事件发生的频率,感觉就像它们踩在 Meters 脚趾上一样.

Based on these definitions, I can't discern the difference between these. What's confusing me is that Timer is not used the way I would have expected it to be used. To me, Timer is just that: a timer; it should measure the time diff between a start() and stop(). But it appears that Timers also capture rates at which events occur, which feels like they are stepping on Meters toes.

如果我能看到每个组件输出的示例,这可能有助于我了解何时/何地使用其中任何一个.

If I could see an example of what each component outputs that might help me understand when/where to use either of these.

推荐答案

您感到困惑的部分原因是 DW Metrics Timer IS 以及 DW Metrics Meter.

You're confused in part because a DW Metrics Timer IS, among other things, a DW Metrics Meter.

Meter 只与速率有关,以赫兹(每秒事件数)为单位.每个 Meter 导致发布 4(?) 个不同的指标:

A Meter is exclusively concerned with rates, measured in Hz (events per second). Each Meter results in 4(?) distinct metrics being published:

  • 自指标启动以来的平均(平均)比率
  • 1、5 和 15 分钟滚动平均费率

您通过在代码中的不同点记录一个值来使用 Meter -- DW Metrics 会自动记下每次调用的挂壁时间以及您提供的值,并使用这些来计算该值出现的速率增加:

You use a Meter by recording a value at different points in your code -- DW Metrics automatically jots down the wall time of each call along with the value you gave it, and uses these to calculate the rate at which that value is increasing:

Meter getRequests = registry.meter("some-operation.operations")
getRequests.mark() //resets the value, e.g. sets it to 0
int numberOfOps = doSomeNumberOfOperations() //takes 10 seconds, returns 333
getRequests.mark(numberOfOps) //sets the value to number of ops.

我们希望我们的速率为 33.3 Hz,因为发生了 333 次操作,并且两次调用 mark() 之间的时间为 10 秒.

We would expect our rates to be 33.3 Hz, as 333 operations occurred and the time between the two calls to mark() was 10 seconds.

计时器计算上述 4 个指标(将每个 Timer.Context 视为一个事件),并向它们添加许多其他指标:

A Timer calculates these above 4 metrics (considering each Timer.Context to be one event) and adds to them a number of additional metrics:

  • 事件数量的计数
  • 自指标开始以来看到的最小、平均和最大持续时间
  • 标准差
  • 一个直方图",记录分布在第 50、97、98、99 和 99.95 个百分位数的持续时间

每个计时器报告了大约 15 个指标.

There are something like 15 total metrics reported for each Timer.

简而言之:计时器报告了大量指标,它们可能难以理解,但一旦您了解它们,它们就会成为发现异常行为的一种非常有效的方法.

In short: Timers report a LOT of metrics, and they can be tricky to understand, but once you do they're a quite powerful way to spot spikey behavior.

事实是,仅仅收集两点之间花费的时间并不是一个非常有用的指标.考虑:你有一个这样的代码块:

Fact is, just collecting the time spent between two points isn't a terribly useful metric. Consider: you have a block of code like this:

Timer timer = registry.timer("costly-operation.service-time")
Timer.Context context = timer.time()
costlyOperation() //service time 10 ms
context.stop()

让我们假设 costlyOperation() 具有恒定的成本、恒定的负载,并且在单个线程上运行.在 1 分钟的报告周期内,我们应该期望这个操作计时 6000 次.显然,我们不会通过 6000x 线路报告实际服务时间——相反,我们需要某种方式来总结所有这些操作以适合我们所需的报告窗口.DW Metrics 的计时器自动为我们执行此操作,每分钟一次(我们的报告周期).5 分钟后,我们的指标注册表将报告:

Let's assume that costlyOperation() has a constant cost, constant load, and operates on a single thread. Inside a 1 minute reporting period, we should expect to time this operation 6000 times. Obviously, we will not be reporting the actual service time over the wire 6000x -- instead, we need some way to summarize all those operations to fit our desired reporting window. DW Metrics' Timer does this for us, automatically, once a minute (our reporting period). After 5 minutes, our metrics registry would be reporting:

  • 速率为 100(每秒事件数)
  • 1 分钟平均速率为 100
  • 5 分钟平均速率为 100
  • 计数为 30000(看到的事件总数)
  • 最多 10(毫秒)
  • 10 分钟
  • 平均值为 10
  • 第 50 个百分位 (p50) 值为 10
  • 第 99.9 个百分位 (p999) 的值为 10

现在,让我们考虑进入一个时期,偶尔我们的操作会完全偏离轨道并长时间阻塞:

Now, let's consider we enter a period where occasionally our operation goes completely off the rails and blocks for an extended period:

Timer timer = registry.timer("costly-operation.service-time")
Timer.Context context = timer.time()
costlyOperation() //takes 10 ms usually, but once every 1000 times spikes to 1000 ms
context.stop()

在 1 分钟的收集期内,我们现在会看到不到 6000 次执行,因为每 1000 次执行需要更长的时间.计算为大约 5505.在第一分钟(系统总时间为 6 分钟)之后,我们现在会看到:

Over a 1 minute collection period, we would now see fewer than 6000 executions, as every 1000th execution takes longer. Works out to about 5505. After the first minute (6 minutes total system time) of this we would now see:

  • 平均速率为 98(每秒事件数)
  • 1 分钟平均速率为 91.75
  • 5 分钟平均心率为 98.35
  • 计数为 35505(看到的事件总数)
  • 最长持续时间为 1000 (ms)
  • 最少 10 分钟
  • 平均持续时间为 10.13
  • 第 50 个百分位 (p50) 值为 10
  • 第 99.9 个百分位数 (p999) 的值为 1000

如果您绘制此图,您会看到大多数请求(p50、p75、p99 等)在 10 毫秒内完成,但 1000 个请求中的一个 (p99) 是在 1 秒内完成的.这也将被视为平均比率略有下降(约 2%)和 1 分钟平均值的大幅下降(近 9%).

If you graph this, you'd see that most requests (the p50, p75, p99, etc) were completing in 10 ms, but one request out of 1000 (p99) was completed in 1s. This would also be seen as a slight reduction in the average rate (about 2%) and a sizable reduction in the 1-minute mean (nearly 9%).

如果您只查看时间平均值(速率或持续时间),您将永远不会发现这些峰值——当对大量成功操作进行平均时,它们会被拖入背景噪音中.同样,仅仅知道最大值也无济于事,因为它不会告诉您最大值出现的频率.这就是直方图是跟踪性能的强大工具的原因,也是 DW Metrics 的计时器发布速率和直方图的原因.

If you only look at over the time mean values (either rate or duration), you'll never spot these spikes -- they get dragged into the background noise when averaged with a lot of successful operations. Similarly, just knowing the max isn't helpful, because it doesn't tell you how frequently the max occurs. This is why histograms are a powerful tool for tracking performance, and why DW Metrics' Timer publishes both a rate AND a histogram.

这篇关于DropWizard 指标仪表与计时器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆