DropWizard Metrics Meters与Timers [英] DropWizard Metrics Meters vs Timers

查看:238
本文介绍了DropWizard Metrics Meters与Timers的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习 DropWizard Metrics库(以前的Coda Hale指标)和我我很困惑何时应该使用 vs 计时器。根据文档:


Meter:一个仪表测量一组事件发生的速率


和:


计时器: 计时器基本上是一种事件类型持续时间的直方图,以及它出现率的计量表


基于这些定义,我无法分辨出这些定义之间的区别。令我困惑的是, Timer 的使用方式与我预期的方式不同。对我来说,计时器就是这样:一个计时器;它应该测量 start() stop()之间的时间差异。但似乎 Timers 也会捕获事件发生的速率,感觉就像是踩到了 Meters 脚趾。 / p>

如果我能看到每个组件输出的示例,这可能有助于我理解何时/何地使用其中任何一个。

解决方案

您之所以感到困惑,是因为DW Metrics Timer IS ,其中包括DW Metrics Meter。



仪表专门关注速率,以Hz(每秒事件数)为单位。每个Meter导致发布4个(?)不同的指标:




  • 自指标启动以来的平均(平均)费率

  • 1,5分钟和15分钟滚动平均费率



您可以通过在不同点记录值来使用仪表你的代码 - DW Metrics会自动记下每个调用的挂起时间以及你给它的值,并使用这些来计算该值增加的速率:

  Meter getRequests = registry.meter(some-operation.operations)
getRequests.mark()//重置值,例如将它设置为0
int numberOfOps = doSomeNumberOfOperations()//需要10秒,返回333
getRequests.mark(numberOfOps)//将值设置为操作数。

我们预计我们的费率为33.3 Hz,因为发生了333次操作以及两次通话之间的时间标记()是10秒。



计时器计算以上4个指标(将每个Timer.Context视为一个事件)并向其添加一些其他指标:




  • 事件数量的计数

  • min,自启动以来所见的平均和最长持续时间度量标准

  • 标准差

  • 直方图,记录分布在第50,97,98,99和99.95百分位数的持续时间



每个计时器报告的总指标数量为15个。



简而言之:计时器报告了很多指标,并且它们可能很难理解,但一旦你这样做,它们就是发现spikey行为的一种非常有效的方法。



事实上,只收集两点之间的时间并不是一个非常有用的指标。考虑一下:你有一个这样的代码块:

 计时器计时器= registry.timer(expensive-operation.service-time )
Timer.Context context = timer.time()
expensiveOperation()//服务时间10 ms
context.stop()

让我们假设expensiveOperation()具有恒定的成本,恒定的负载并在单个线程上运行。在1分钟的报告期内,我们应该期望这次操作6000次。显然,我们不会通过电线6000x报告实际服务时间 - 相反,我们需要一些方法来总结所有这些操作以适应我们所需的报告窗口。 DW Metrics'Timer为我们自动执行此操作,每分钟一次(我们的报告期)。 5分钟后,我们的指标注册表将报告:




  • 费率为100(每秒事件数)

  • 1分钟平均费率100

  • 5分钟平均费率100

  • 计数30000(所见总事件数)

  • 最多10(ms)

  • 最少10分钟

  • 平均10分

  • 第50百分位(p50)值10

  • 第99.9百分位(p999)值10



现在,让我们考虑进入一个时期,我们的操作偶尔会完全脱离轨道并长时间阻塞:

 计时器计时器= registry.timer(expensive-operation.service-time)
Timer.Context context = timer.time()
expensiveOperation()//需要10通常为ms,但每1000次峰值到1000 ms
context.stop()

在1分钟的收集期内,我们现在看到的执行次数少于6000次,因为每1000次执行都需要执行手指。计算出大约5505.在第一分钟(总系统时间为6分钟)后,我们现在看到:




  • 平均速率为98(每秒事件数)

  • 1分钟平均费率91.75

  • 5分钟平均费率98.35

  • 计数35505(所见总事件数)

  • 最长持续时间1000(ms)

  • 最小持续时间10

  • 平均持续时间为10.13

  • 第50百分位(p50)值为10

  • 99.9%(p999) )价值1000



如果您将此图表显示,您会看到大多数请求(p50,p75,p99等)正在完成在10毫秒内,但是1000(p99)中的一个请求在1秒内完成。这也可以看作是平均利率的轻微下降(约2%)和1分钟均值(接近9%)的大幅减少。



如果你只看一下随着时间的推移(无论是速率还是持续时间),你永远不会发现这些尖峰 - 它们会在很多成功的操作平均时被拖入背景噪音中。同样,只知道max是没有用的,因为它不会告诉你max发生的频率。这就是为什么直方图是跟踪性能的强大工具,以及为什么DW Metrics的Timer发布速率和直方图。


I am learning the DropWizard Metrics library (formerly Coda Hale metrics) and I am confused as to when I should be using Meters vs Timers. According to the docs:

Meter: A meter measures the rate at which a set of events occur

and:

Timer: A timer is basically a histogram of the duration of a type of event and a meter of the rate of its occurrence

Based on these definitions, I can't discern the difference between these. What's confusing me is that Timer is not used the way I would have expected it to be used. To me, Timer is just that: a timer; it should measure the time diff between a start() and stop(). But it appears that Timers also capture rates at which events occur, which feels like they are stepping on Meters toes.

If I could see an example of what each component outputs that might help me understand when/where to use either of these.

解决方案

You're confused in part because a DW Metrics Timer IS, among other things, a DW Metrics Meter.

A Meter is exclusively concerned with rates, measured in Hz (events per second). Each Meter results in 4(?) distinct metrics being published:

  • an mean (average) rate since Metrics was started
  • 1, 5 and 15 minute rolling mean rates

You use a Meter by recording a value at different points in your code -- DW Metrics automatically jots down the wall time of each call along with the value you gave it, and uses these to calculate the rate at which that value is increasing:

Meter getRequests = registry.meter("some-operation.operations")
getRequests.mark() //resets the value, e.g. sets it to 0
int numberOfOps = doSomeNumberOfOperations() //takes 10 seconds, returns 333
getRequests.mark(numberOfOps) //sets the value to number of ops.

We would expect our rates to be 33.3 Hz, as 333 operations occurred and the time between the two calls to mark() was 10 seconds.

A Timer calculates these above 4 metrics (considering each Timer.Context to be one event) and adds to them a number of additional metrics:

  • a count of the number of events
  • min, mean and max durations seen since start of Metrics
  • standard deviation
  • a "histogram," recording the duration distributed at the 50th, 97th, 98th, 99th and 99.95 percentiles

There are something like 15 total metrics reported for each Timer.

In short: Timers report a LOT of metrics, and they can be tricky to understand, but once you do they're a quite powerful way to spot spikey behavior.

Fact is, just collecting the time spent between two points isn't a terribly useful metric. Consider: you have a block of code like this:

Timer timer = registry.timer("costly-operation.service-time")
Timer.Context context = timer.time()
costlyOperation() //service time 10 ms
context.stop()

Let's assume that costlyOperation() has a constant cost, constant load and operates on a single thread. Inside a 1 minute reporting period, we should expect to time this operation 6000 times. Obviously, we will not be reporting the actual service time over the wire 6000x -- instead we need some way to summarize all those operations to fit our desired reporting window. DW Metrics' Timer does this for us, automatically, once a minute (our reporting period). After 5 minutes, our metrics registry would be reporting:

  • a rate of 100 (events per second)
  • a 1 minute mean rate of 100
  • a 5 minute mean rate of 100
  • a count of 30000 (total events seen)
  • a max of 10 (ms)
  • a min of 10
  • a mean of 10
  • a 50th percentile (p50) value of 10
  • a 99.9th percentile (p999) value of 10

Now, let's consider we enter a period where occasionally our operation goes completely off the rails and blocks for an extended period:

Timer timer = registry.timer("costly-operation.service-time")
Timer.Context context = timer.time()
costlyOperation() //takes 10 ms usually, but once every 1000 times spikes to 1000 ms
context.stop()

Over a 1 minute collection period, we would now see fewer than 6000 executions, as every 1000th execution takes longer. Works out to about 5505. After the first minute (6 minutes total system time) of this we would now see:

  • a mean rate of 98 (events per second)
  • a 1 minute mean rate of 91.75
  • a 5 minute mean rate of 98.35
  • a count of 35505 (total events seen)
  • a max duration of 1000 (ms)
  • a min duration of 10
  • a mean duration of 10.13
  • a 50th percentile (p50) value of 10
  • a 99.9th percentile (p999) value of 1000

If you graphed this, you'd see that most requests (the p50, p75, p99 etc) were completing in 10 ms, but one request out of 1000 (p99) was completing in 1s. This would also be seen as a slight reduction in the average rate (about 2%) and a sizable reduction in the 1 minute mean (nearly 9%).

If you only look at the over time means (either rate or duration), you'll never spot these spikes -- they get dragged into the background noise when averaged with a lot of successful operations. Similarly, just knowing the max isn't helpful, because it doesn't tell you how frequently the max occurs. This is why histograms are a powerful tool for tracking performance, and why DW Metrics' Timer publishes both a rate AND a histogram.

这篇关于DropWizard Metrics Meters与Timers的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆