如何分析在 Linux 上运行的 C++ 代码? [英] How can I profile C++ code running on Linux?

查看:28
本文介绍了如何分析在 Linux 上运行的 C++ 代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个在 Linux 上运行的 C++ 应用程序,我正在对其进行优化.如何确定代码的哪些区域运行缓慢?

解决方案

如果您的目标是使用分析器,请使用推荐的其中之一.

但是,如果您很着急并且可以在调试器下手动中断程序,而它主观上很慢,那么有一种简单的方法可以找到性能问题.

暂停几次,每次都查看调用堆栈.如果有一些代码浪费了一定百分比的时间,20% 或 50% 或其他什么,这就是您在每个样本的行为中捕获它的可能性.因此,这大致是您将看到的样本百分比.不需要有根据的猜测.如果您确实猜到了问题所在,这将证明或反驳它.

您可能有多个不同大小的性能问题.如果您清除其中任何一个,其余的将占据更大的百分比,并且在随后的传球中更容易被发现.这种放大效应,当在多个问题上复合时,会导致真正巨大的加速因素.

警告:程序员往往对这种技术持怀疑态度,除非他们自己使用过.他们会说分析器会为您提供此信息,但只有当它们对整个调用堆栈进行采样,然后让您检查一组随机样本时,情况才会如此.(摘要是洞察力丢失的地方.)调用图不会为您提供相同的信息,因为

  1. 他们不会在指令级别进行总结,并且
  2. 他们在递归的情况下给出了令人困惑的总结.

他们还会说它只适用于玩具程序,而实际上它适用于任何程序,而且似乎在更大的程序上更有效,因为他们往往有更多的问题需要发现.他们会说它有时会发现一些不是问题的东西,但只有当你看到某些东西一次时,这才是正确的.如果您在多个样本上发现问题,那就是真实存在的问题.

PS如果有办法在某个时间点收集线程池的调用堆栈样本,也可以在多线程程序上完成,就像在 Java 中一样.>

PPS 粗略地说,您的软件中的抽象层越多,您就越有可能发现这是导致性能问题的原因(以及获得加速的机会).

添加:这可能不明显,但堆栈采样技术在存在递归的情况下同样有效.原因是删除一条指令所节省的时间近似于包含它的样本的分数,而不管它在一个样本中可能出现的次数.

我经常听到的另一个反对意见是:它会随机停在某个地方,它会错过真正的问题".这来自于对真正的问题是什么有一个先验的概念.性能问题的一个关键特性是它们不符合预期.抽样告诉您某事有问题,您的第一反应是难以置信.这是很自然的,但您可以确定它是否发现问题是真实存在的,反之亦然.

添加:让我对它的工作原理做一个贝叶斯解释.假设有一些指令 I(调用或其他)在调用堆栈上的部分时间 f 上(因此成本很高).为简单起见,假设我们不知道 f 是什么,但假设它是 0.1、0.2、0.3、... 0.9、1.0,并且这些可能性中的每一个的先验概率是 0.1,所以所有这些成本的可能性都是先验的.

然后假设我们只取了 2 个堆栈样本,我们在两个样本上看到指令 I,指定观察 o=2/2.这给了我们对 I 的频率 f 的新估计,根据这个:

之前P(f=x) x P(o=2/2|f=x) P(o=2/2&&f=x) P(o=2/2&&f >= x) P(f >= x | o=2/2)0.1 1 1 0.1 0.1 0.259740260.1 0.9 0.81 0.081 0.181 0.470129870.1 0.8 0.64 0.064 0.245 0.6363636360.1 0.7 0.49 0.049 0.294 0.7636363640.1 0.6 0.36 0.036 0.33 0.8571428570.1 0.5 0.25 0.025 0.355 0.9220779220.1 0.4 0.16 0.016 0.371 0.9636363640.1 0.3 0.09 0.009 0.38 0.9870129870.1 0.2 0.04 0.004 0.384 0.9974025970.1 0.1 0.01 0.001 0.385 1P(o=2/2) 0.385

最后一列表示,例如,f >= 0.5 的概率为 92%,高于先前假设的 60%.

假设先前的假设不同.假设我们假设 P(f=0.1) 是 0.991(几乎确定),而所有其他可能性几乎是不可能的(0.001).换句话说,我们之前的确定是 I 很便宜.然后我们得到:

之前P(f=x) x P(o=2/2|f=x) P(o=2/2&& f=x) P(o=2/2&&f >= x) P(f >= x | o=2/2)0.001 1 1 0.001 0.001 0.0727272730.001 0.9 0.81 0.00081 0.00181 0.1316363640.001 0.8 0.64 0.00064 0.00245 0.1781818180.001 0.7 0.49 0.00049 0.00294 0.2138181820.001 0.6 0.36 0.00036 0.0033 0.240.001 0.5 0.25 0.00025 0.00355 0.2581818180.001 0.4 0.16 0.00016 0.00371 0.2698181820.001 0.3 0.09 0.00009 0.0038 0.2763636360.001 0.2 0.04 0.00004 0.00384 0.2792727270.991 0.1 0.01 0.00991 0.01375 1P(o=2/2) 0.01375

现在它说 P(f >= 0.5) 是 26%,高于之前假设的 0.6%.因此,贝叶斯允许我们更新对 I 的可能成本的估计.如果数据量很小,它并不能准确告诉我们成本是多少,只能说它大到值得修复.

另一种看待它的方式称为

测量是水平的;它会告诉您特定例程需要多少时间.采样是垂直的.如果有任何方法可以避免当时整个程序正在执行的操作,并且如果您在第二个示例中看到它,那么您就找到了瓶颈.这就是与众不同的原因 - 查看花费时间的全部原因,而不仅仅是花费多少.

I have a C++ application, running on Linux, which I'm in the process of optimizing. How can I pinpoint which areas of my code are running slowly?

解决方案

If your goal is to use a profiler, use one of the suggested ones.

However, if you're in a hurry and you can manually interrupt your program under the debugger while it's being subjectively slow, there's a simple way to find performance problems.

Just halt it several times, and each time look at the call stack. If there is some code that is wasting some percentage of the time, 20% or 50% or whatever, that is the probability that you will catch it in the act on each sample. So, that is roughly the percentage of samples on which you will see it. There is no educated guesswork required. If you do have a guess as to what the problem is, this will prove or disprove it.

You may have multiple performance problems of different sizes. If you clean out any one of them, the remaining ones will take a larger percentage, and be easier to spot, on subsequent passes. This magnification effect, when compounded over multiple problems, can lead to truly massive speedup factors.

Caveat: Programmers tend to be skeptical of this technique unless they've used it themselves. They will say that profilers give you this information, but that is only true if they sample the entire call stack, and then let you examine a random set of samples. (The summaries are where the insight is lost.) Call graphs don't give you the same information, because

  1. They don't summarize at the instruction level, and
  2. They give confusing summaries in the presence of recursion.

They will also say it only works on toy programs, when actually it works on any program, and it seems to work better on bigger programs, because they tend to have more problems to find. They will say it sometimes finds things that aren't problems, but that is only true if you see something once. If you see a problem on more than one sample, it is real.

P.S. This can also be done on multi-thread programs if there is a way to collect call-stack samples of the thread pool at a point in time, as there is in Java.

P.P.S As a rough generality, the more layers of abstraction you have in your software, the more likely you are to find that that is the cause of performance problems (and the opportunity to get speedup).

Added: It might not be obvious, but the stack sampling technique works equally well in the presence of recursion. The reason is that the time that would be saved by removal of an instruction is approximated by the fraction of samples containing it, regardless of the number of times it may occur within a sample.

Another objection I often hear is: "It will stop someplace random, and it will miss the real problem". This comes from having a prior concept of what the real problem is. A key property of performance problems is that they defy expectations. Sampling tells you something is a problem, and your first reaction is disbelief. That is natural, but you can be sure if it finds a problem it is real, and vice-versa.

Added: Let me make a Bayesian explanation of how it works. Suppose there is some instruction I (call or otherwise) which is on the call stack some fraction f of the time (and thus costs that much). For simplicity, suppose we don't know what f is, but assume it is either 0.1, 0.2, 0.3, ... 0.9, 1.0, and the prior probability of each of these possibilities is 0.1, so all of these costs are equally likely a-priori.

Then suppose we take just 2 stack samples, and we see instruction I on both samples, designated observation o=2/2. This gives us new estimates of the frequency f of I, according to this:

Prior                                    
P(f=x) x  P(o=2/2|f=x) P(o=2/2&&f=x)  P(o=2/2&&f >= x)  P(f >= x | o=2/2)

0.1    1     1             0.1          0.1            0.25974026
0.1    0.9   0.81          0.081        0.181          0.47012987
0.1    0.8   0.64          0.064        0.245          0.636363636
0.1    0.7   0.49          0.049        0.294          0.763636364
0.1    0.6   0.36          0.036        0.33           0.857142857
0.1    0.5   0.25          0.025        0.355          0.922077922
0.1    0.4   0.16          0.016        0.371          0.963636364
0.1    0.3   0.09          0.009        0.38           0.987012987
0.1    0.2   0.04          0.004        0.384          0.997402597
0.1    0.1   0.01          0.001        0.385          1

                  P(o=2/2) 0.385                

The last column says that, for example, the probability that f >= 0.5 is 92%, up from the prior assumption of 60%.

Suppose the prior assumptions are different. Suppose we assume P(f=0.1) is .991 (nearly certain), and all the other possibilities are almost impossible (0.001). In other words, our prior certainty is that I is cheap. Then we get:

Prior                                    
P(f=x) x  P(o=2/2|f=x) P(o=2/2&& f=x)  P(o=2/2&&f >= x)  P(f >= x | o=2/2)

0.001  1    1              0.001        0.001          0.072727273
0.001  0.9  0.81           0.00081      0.00181        0.131636364
0.001  0.8  0.64           0.00064      0.00245        0.178181818
0.001  0.7  0.49           0.00049      0.00294        0.213818182
0.001  0.6  0.36           0.00036      0.0033         0.24
0.001  0.5  0.25           0.00025      0.00355        0.258181818
0.001  0.4  0.16           0.00016      0.00371        0.269818182
0.001  0.3  0.09           0.00009      0.0038         0.276363636
0.001  0.2  0.04           0.00004      0.00384        0.279272727
0.991  0.1  0.01           0.00991      0.01375        1

                  P(o=2/2) 0.01375                

Now it says P(f >= 0.5) is 26%, up from the prior assumption of 0.6%. So Bayes allows us to update our estimate of the probable cost of I. If the amount of data is small, it doesn't tell us accurately what the cost is, only that it is big enough to be worth fixing.

Yet another way to look at it is called the Rule Of Succession. If you flip a coin 2 times, and it comes up heads both times, what does that tell you about the probable weighting of the coin? The respected way to answer is to say that it's a Beta distribution, with average value (number of hits + 1) / (number of tries + 2) = (2+1)/(2+2) = 75%.

(The key is that we see I more than once. If we only see it once, that doesn't tell us much except that f > 0.)

So, even a very small number of samples can tell us a lot about the cost of instructions that it sees. (And it will see them with a frequency, on average, proportional to their cost. If n samples are taken, and f is the cost, then I will appear on nf+/-sqrt(nf(1-f)) samples. Example, n=10, f=0.3, that is 3+/-1.4 samples.)


Added: To give an intuitive feel for the difference between measuring and random stack sampling:
There are profilers now that sample the stack, even on wall-clock time, but what comes out is measurements (or hot path, or hot spot, from which a "bottleneck" can easily hide). What they don't show you (and they easily could) is the actual samples themselves. And if your goal is to find the bottleneck, the number of them you need to see is, on average, 2 divided by the fraction of time it takes. So if it takes 30% of time, 2/.3 = 6.7 samples, on average, will show it, and the chance that 20 samples will show it is 99.2%.

Here is an off-the-cuff illustration of the difference between examining measurements and examining stack samples. The bottleneck could be one big blob like this, or numerous small ones, it makes no difference.

Measurement is horizontal; it tells you what fraction of time specific routines take. Sampling is vertical. If there is any way to avoid what the whole program is doing at that moment, and if you see it on a second sample, you've found the bottleneck. That's what makes the difference - seeing the whole reason for the time being spent, not just how much.

这篇关于如何分析在 Linux 上运行的 C++ 代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆