惯用的绩效评估方式? [英] Idiomatic way of performance evaluation?

查看:57
本文介绍了惯用的绩效评估方式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为我的项目评估网络+渲染工作量.

程序连续运行一个主循环:

while (true) {
   doSomething()
   drawSomething()
   doSomething2()
   sendSomething()
}

主循环每秒运行60次以上.

我想查看性能细分,每个过程花费多少时间.

我担心的是,如果我为每个过程的每个入口和出口打印时间间隔,

这会产生巨大的性能开销.

我很好奇什么是衡量表现的惯用方式.

打印日志足够好吗?

解决方案

通常:对于重复的简短内容,您可以为整个重复循环计时. (但是微基准测试很难;除非您了解这样做的含义,否则很容易扭曲结果.)

或者,如果您坚持对每个单独的迭代进行计时,则将结果记录在数组中,然后再打印;您不想在循环中调用笨重的打印代码.

这个问题太广泛了,无法说出更具体的内容.

许多语言都有基准测试包,可帮助您编写单个功能的微基准.使用它们.例如对于Java,JMH确保在进行定时运行之前,JIT和所有爵士乐对被测函数进行了预热和完全优化.并在指定的时间间隔内运行它,计算完成的迭代次数.

当心常见的微基准测试陷阱:

  • 无法预热代码/数据高速缓存和其他内容:定时区域内的页面错误(用于触摸新内存)或代码/数据高速缓存未命中,这不是正常操作的一部分. (注意到此效果的示例:性能:memset RDTSC会计算参考周期,而不是核心时钟周期,因此它受到与挂钟时间相同的CPU频率变化影响.

  • 在具有乱序执行的现代CPU上,有些事情太短了,无法真正有意义地计时,另请参见需要在启用优化的情况下进行编译,但是您还需要阻止编译器优化工作,或者将其退出循环.确保使用结果(例如打印结果或将其存储到volatile),以便编译器必须产生结果.对输入使用随机数或其他内容而不是编译时常量,这样您的编译器就无法对在实际用例中不是常量的事物进行常量传播.在C语言中,您有时可以为此使用内联asm或volatile,例如这个问题正在询问的.像 Google Benchmark 这样的基准测试包将包含用于此目的的功能.

  • 如果函数的实际用例允许它内联到某些输入是恒定的调用者中,或者可以将操作优化为其他工作,那么单独对它进行基准测试不是很有用.
  • 具有大量特殊情况的特殊处理的大型复杂函数在您反复运行时会在微基准测试中快速查找,尤其是每次使用 same 输入时.在现实生活中的用例中,分支预测通常不会使用该输入针对该功能进行准备.此外,大规模展开的循环在微基准测试中看起来不错,但在现实生活中,它庞大的指令缓存占用空间会减慢其他所有操作的速度,从而导致其他代码被逐出.

与最后一点有关:如果函数的实际用例包含许多小输入,则不要仅对大量输入进行调整.例如memcpy实现对大量输入非常有用,但花很长时间才能弄清楚对较小输入使用哪种策略可能不是很好.这是一个权衡;确保它对于大型输入足够好,但对于小型输入则保持较低的开销.

石蕊测试:

  • 如果要在一个程序中对两个函数进行基准测试:如果颠倒测试顺序会改变结果,则基准测试是不公平的.例如函数A可能看起来很慢,因为您要先对其进行测试,并且预热不足.例如:为什么std :: vector比数组要慢?(不是,第一个运行的循环必须补偿所有页面错误和缓存未命中;第二个循环只是放大填充相同的内存.)

  • 增加重复循环的迭代次数应线性增加总时间,并且不影响所计算的每次通话时间.如果不是这样,那么您将有不可忽略的测量开销或优化的代码(例如,吊起循环并仅运行一次而不是N次).

即更改测试参数以进行健全性检查.


对于C/C ++,另请参见 解决方案

Generally: For repeated short things, you can just time the whole repeat loop. (But microbenchmarking is hard; easy to distort results unless you understand the implications of doing that.)

Or if you insist on timing each separate iteration, record the results in an array and print later; you don't want to invoke heavy-weight printing code inside your loop.

This question is way too broad to say anything more specific.

Many languages have benchmarking packages that will help you write microbenchmarks of a single function. Use them. e.g. for Java, JMH makes sure the function under test is warmed up and fully optimized by the JIT, and all that jazz, before doing timed runs. And runs it for a specified interval, counting how many iterations it completes.

Beware common microbenchmark pitfalls:

  • Failure to warm up code / data caches and stuff: page faults within the timed region for touching new memory, or code / data cache misses, that wouldn't be part of normal operation. (Example of noticing this effect: Performance: memset; example of a wrong conclusion based on this mistake)
  • Failure to give the CPU time to ramp up to max turbo: modern CPUs clock down to idle speeds to save power, only clocking up after a few milliseconds. (Or longer depending on the OS / HW).

    related: on modern x86, RDTSC counts reference cycles, not core clock cycles, so it's subject to the same CPU-frequency variation effects as wall-clock time.

  • On modern CPUs with out-of-order execution, some things are too short to truly time meaningfully, see also this. Performance of a tiny block of assembly language (e.g. generated by a compiler for one function) can't be characterized by a single number, even if it doesn't branch or access memory (so no chance of mispredict or cache miss). It has latency from inputs to outputs, but different throughput if run repeatedly with independent inputs is higher. e.g. an add instruction on a Skylake CPU has 4/clock throughput, but 1 cycle latency. So dummy = foo(x) can be 4x faster than x = foo(x); in a loop. Floating-point instructions have higher latency than integer, so it's often a bigger deal. Memory access is also pipelined on most CPUs, so looping over an array (address for next load easy to calculate) is often much faster than walking a linked list (address for next load isn't available until the previous load completes).

    Obviously performance can differ between CPUs; in the big picture usually it's rare for version A to be faster on Intel, version B to be faster on AMD, but that can easily happen in the small scale. When reporting / recording benchmark numbers, always note what CPU you tested on.

  • Related to the above and below points: you can't benchmark the * operator in C, for example. Some use-cases for it will compile very differently from others, e.g. tmp = foo * i; in a loop can often turn into tmp += foo (strength reduction), or if the multiplier is a constant power of 2 the compiler will just use a shift. The same operator in the source can compile to very different instructions, depending on surrounding code.
  • You need to compile with optimization enabled, but you also need to stop the compiler from optimizing away the work, or hoisting it out of a loop. Make sure you use the result (e.g. print it or store it to a volatile) so the compiler has to produce it. Use a random number or something instead of a compile-time constant for an input so your compiler can't do constant-propagation for things that won't be constants in your real use-case. In C you can sometimes use inline asm or volatile for this, e.g. the stuff this question is asking about. A good benchmarking package like Google Benchmark will include functions for this.
  • If the real use-case for a function lets it inline into callers where some inputs are constant, or the operations can be optimized into other work, it's not very useful to benchmark it on its own.
  • Big complicated functions with special handling for lots of special cases can look fast in a microbenchmark when you run them repeatedly, especially with the same input every time. In real life use-cases, branch prediction often won't be primed for that function with that input. Also, a massively unrolled loop can look good in a microbenchmark, but in real life it slows everything else down with its big instruction-cache footprint leading to eviction of other code.

Related to that last point: Don't tune only for huge inputs, if the real use-case for a function includes a lot of small inputs. e.g. a memcpy implementation that's great for huge inputs but takes too long to figure out which strategy to use for small inputs might not be good. It's a tradeoff; make sure it's good enough for large inputs, but also keep overhead low for small inputs.

Litmus tests:

  • If you're benchmarking two functions in one program: if reversing the order of testing changes the results, your benchmark isn't fair. e.g. function A might only look slow because you're testing it first, with insufficient warm-up. example: Why is std::vector slower than an array? (it's not, whichever loop runs first has to pay for all the page faults and cache misses; the 2nd just zooms through filling the same memory.)

  • Increasing the iteration count of a repeat loop should linearly increase the total time, and not affect the calculated time-per-call. If not, then you have non-negligible measurement overhead or your code optimized away (e.g. hoisted out of the loop and runs only once instead of N times).

i.e. vary the test parameters as a sanity check.


For C / C++, see also Simple for() loop benchmark takes the same time with any loop bound where I went into some more detail about microbenchmarking and using volatile or asm to stop important work from optimizing away with gcc/clang.

这篇关于惯用的绩效评估方式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆