哪个是最可靠的分析工具gprof或k​​cachegrind? [英] Which is the most reliable profiling tool gprof or kcachegrind?

查看:182
本文介绍了哪个是最可靠的分析工具gprof或k​​cachegrind?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 gprof kcachegrind 分析一些C ++数字处理代码,可为最有助于执行时间(50-80%取决于输入),但对于10-30%之间的函数,这两个工具给出不同的结果。这是否意味着其中一个是不可靠的?

实际上是相当原始的。

这是它的作用。
1)以恒定速率对程序计数器进行采样,并记录每个函数中的多少个样本(独占时间)。
2)它计算任何函数A调用任意函数B的次数。
从中可以得出每个函数被调用的次数,以及它的平均独占时间。
要获得每个函数的平均包含时间,它在调用图中向上传播独占时间。



如果你希望这有一定的准确性,你应该知道一些问题。
首先,它只计算进程中的CPU时间,这意味着它对I / O或其他系统调用不起作用。
第二,递归使它困惑。
第三,前提是函数总是坚持平均运行时间,无论什么时候调用或者调用它们,都是非常可疑的。
Forth,函数(和它们的调用图)是你需要知道的概念,而不是代码行,只是一个流行的假设,没有更多。
第五,测量的准确性甚至与发现瓶颈相关的概念也只是一个普遍的假设,没有更多。



Callgrind 可以在行级别工作 - 这是好。不幸的是,它也有其他问题。



如果你的目标是找到瓶颈(而不是获取一般的测量),你应该看看挂钟时间用于报告逐行百分比的堆栈采样器,例如缩放
原因很简单,但可能不熟悉。



假设你有一个程序,一堆函数调用对方,总共需要10秒。此外,还有一个采样器,它不仅采样程序计数器,而且采样整个调用栈,并且它以恒定速率(如每秒100次)一直采样。 (现在忽略其他进程。)



所以最后你有1000个调用堆栈的样本。
选择出现在多个代码中的任意一行代码L.
假设你能以某种方式优化该行,避免它,删除它,或将它传递给一个真正快速的处理器。



样本?



由于这行代码L现在(基本上)没有时间,没有样本可以命中,所以这些样本只会消失,减少样本总数,从而减少总时间!
事实上,整个时间将减少L在堆栈上的时间比例,这大约是包含它的样本的比例。



我不想得到太统计,但很多人认为你需要很多样品,因为他们认为测量的准确性很重要。
这不是,如果你这样做的原因是找出什么来解决加速。
重点在于寻找
要修复的内容,而不是测量
线L在堆栈上的一部分F的时间,对不对?
所以每个样本都有一个概率F击中它,对吧?就像翻转硬币。
有一个这样的理论,称为继承规则
它说(在简化但一般假设下),如果你翻转硬币N次,看到头S次,你可以估计硬币的公平性F(平均)(S + 1)/(N + 2)
所以,如果你只需要少于三个的样本,并且在两个上看到L,你知道F是什么? 当然不是。
但你平均知道是(2 + 1)/(3 + 2)或 60% >。
这就是你可以通过优化离开行L来保存(平均)多少时间。
当然,堆栈样本显示你完全其中行L 瓶颈**)是。 bb b b b b b b b b //stackoverflow.com/questions/1777556/alternatives-to-gprof/1779343#1779343\">上述其他问题。



** I保持引用瓶颈,因为什么使大多数软件慢与瓶颈没有什么共同点。
一个更好的比喻是一个流失 - 只是不必要地浪费时间。


Profiling some C++ number crunching code with both gprof and kcachegrind gives similar results for the functions that contribute most to the execution time (50-80% depending on input) but for functions between 10-30% both these tools give different results. Does it mean one of them is not reliable? What would yo do here?

解决方案

gprof is actually quite primitive. Here's what it does. 1) It samples the program counter at a constant rate and records how many samples land in each function (exclusive time). 2) It counts how many times any function A calls any function B. From that it can find out how many times each function was called in total, and what it's average exclusive time was. To get average inclusive time of each function it propagates exclusive time upward in the call graph.

If you're expecting this to have some kind of accuracy, you should be aware of some issues. First, it only counts CPU-time-in-process, meaning it is blind to I/O or other system calls. Second, recursion confuses it. Third, the premise that functions always adhere to an average run time, no matter when they are called or who calls them, is very suspect. Forth, the notion that functions (and their call graph) are what you need to know about, rather than lines of code, is simply a popular assumption, nothing more. Fifth, the notion that accuracy of measurement is even relevant to finding "bottlenecks" is also just a popular assumption, nothing more.

Callgrind can work at the level of lines - that's good. Unfortunately it shares the other problems.

If your goal is to find "bottlenecks" (as opposed to getting general measurements), you should take a look at wall-clock time stack samplers that report percent-by-line, such as Zoom. The reason is simple but possibly unfamiliar.

Suppose you have a program with a bunch of functions calling each other that takes a total of 10 seconds. Also, there is a sampler that samples, not just the program counter, but the entire call stack, and it does it all the time at a constant rate, like 100 times per second. (Ignore other processes for now.)

So at the end you have 1000 samples of the call stack. Pick any line of code L that appears on more than one of them. Suppose you could somehow optimize that line, by avoiding it, removing it, or passing it off to a really really fast processor.

What would happen to those samples?

Since that line of code L now takes (essentially) no time at all, no sample can hit it, so those samples would just disappear, reducing the total number of samples, and therefore the total time! In fact the overall time would be reduced by the fraction of time L had been on the stack, which is roughly the fraction of samples that contained it.

I don't want to get too statistical, but many people think you need a lot of samples, because they think accuracy of measurement is important. It isn't, if the reason you're doing this is to find out what to fix to get speedup. The emphasis is on finding what to fix, not on measuring it. Line L is on the stack some fraction F of the time, right? So each sample has a probability F of hitting it, right? Just like flipping a coin. There is a theory of this, called the Rule of Succession. It says that (under simplifying but general assumptions), if you flip a coin N times, and see "heads" S times, you can estimate the fairness of the coin F as (on average) (S+1)/(N+2). So, if you take as few as three samples, and see L on two of them, do you know what F is? Of course not. But you do know on average it is (2+1)/(3+2) or 60%. So that's how much time you could save (on average) by "optimizing away" line L. And, of course, the stack samples showed you exactly where line L (the "bottleneck"**) is. Did it really matter that you didn't measure it to two or three decimal places?

BTW, it is immune to all the other problems mentioned above.

**I keep putting quotes around "bottleneck" because what makes most software slow has nothing in common with the neck of a bottle. A better metaphor is a "drain" - something that just needlessly wastes time.

这篇关于哪个是最可靠的分析工具gprof或k​​cachegrind?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆