使用__gnu_mcount_nc捕获函数退出时间 [英] Capturing function exit time with __gnu_mcount_nc

查看:282
本文介绍了使用__gnu_mcount_nc捕获函数退出时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在一个支持不良的原型嵌入式平台上做一些性能分析。



我注意到GCC的-pg标志会导致在每个函数的入口插入 __ gnu_mcount_nc 。没有实现 __ gnu_mcount_nc 可用(并且供应商不感兴趣协助),然而,因为它是一个简单地记录堆栈帧和当前周期计数,微不足道已经这样做;这工作正常,并在调用者/被调用图和最常被调用的函数方面产生有用的结果。



我真的想获得有关在函数体但是我很难理解如何只使用入口而不是退出来处理每个函数被钩住的情况:你可以准确地知道每个函数何时被输入,但是没有钩住出口点,你不能知道多少的时间,直到您收到下一条信息属性到被调用者和多少给呼叫者。



然而,GNU分析工具实际上能够在许多平台上收集函数的运行时信息,因此可能开发人员已经考虑到了一些方案。 / p>

我已经看到一些现有的实现,像维护一个阴影调用堆栈和旋转返回地址到__gnu_mcount_nc,使__gnu_mcount_nc将被调用再次被调用者返回时;它可以匹配调用者/ callee / sp三元组与影子调用堆栈的顶部,所以区分这种情况和入口的调用,记录退出时间并正确返回调用者。



这种方法有很多不足之处:




  • 在缺少递归的库和没有使用-pg标志的库中

  • 似乎很难以低开销实现,或者在嵌入式多线程/多核环境中工具链TLS支持是缺席的,并且当前线程ID可能是昂贵/复杂的获得



有一些明显更好的方法来实现__gnu_mcount_nc, -p g build能够捕获函数退出以及我缺少的进入时间。

解决方案

em>不使用该函数用于计时,条目退出,而是用于调用任何函数B的函数A的调用计数。
相反,它使用通过计数收集的自身时间在每个例程中的PC样本,然后使用函数到函数调用计数来估计应该向调用者返回多少自我时间。



例如,如果A调用C 10次,B调用C 20次,并且C具有1000ms的自身时间(即100个PC样本),则gprof 知道C已被调用30次,并且33个样品可以充电到A,而其他67可以充电到B.
同样,样品计数在调用层次结构中向上传播。



它不会时间函数进入和退出。
它获得的测量是非常粗糙的,因为它不区分短呼叫和长呼叫。
此外,如果PC样本发生在I / O期间或在未使用-pg编译的库例程中,则不会计数。
并且,正如你所说,在递归的存在下它是非常脆弱的,并且可以在短函数上引入显着的开销。



另一种方法是栈抽样,而不是PC采样。
授予,捕获堆栈样本比PC样本更昂贵,但需要更少的样本。
例如,如果一个函数,代码行或你想要做的任何描述,在总共N个样本的分数F中是明显的,那么你知道它花费的时间分数是F,标准偏差为sqrt(NF(1-F))。
所以,例如,如果你取100个样本,并且一行代码出现在其中50个,那么你可以估计线成本50%的时间,不确定性为sqrt(100 * .5 * .5)= +/- 5个样品或45%至55%之间。
如果你用100倍的样本数,你可以将不确定性减少10倍。
(递归不重要,如果一个函数或一行代码在单个样本中出现3次,计数为1个样本,而不是3.
如果函数调用很短 - 如果它们被调用足够多的时间花费很大一部分,它们也会被捕获。)



注意,当你寻找的东西,你可以修复,以获得加速,确切的百分比无关紧要。
重要的是找到它。
(事实上,你只需要看到一个问题两次就知道它足够大,可以修复。)



这是 此技巧






PS不要陷入调用图,热门路径或热点。
这是一个典型的call-graph老鼠的巢。黄色是热点,红色是热点。





这显示了一个多汁的加速机会是多么容易在这些地方:



>



最重要的事情是看十几个随机的原始堆栈样本,并将它们与源代码相关联。
(这意味着绕过分析器的后端。)



ADDED:为了显示我的意思,我从调用图模拟了十个堆栈样本




  • 3/10个示例正在调用 class_exists 一个用于获取类名,另一个用于设置本地配置。 class_exists 调用 requireFile 调用 autoload 调用 adminpanel

  • 2/10个样品正在调用 resolveId ,它调用,它调用 getPageAndRootlineWithDomain ,它调用另外三个级别,终止于 sql_fetch_assoc 。这看起来很难获取ID,它花费了约20%的时间,这不是I / O。



因此,堆栈样本不仅告诉你函数或代码行的包容性时间,它们告诉你为什么要完成它,以及完成它需要什么样的可能性。
我经常看到这个 - 舞动的通用性 - 用锤子拍打苍蝇,而不是故意,但只是遵循良好的模块化设计。



添加:另一件事不要吸into是火焰图
例如,这里是来自上面调用图的十个模拟堆栈样本的火焰图(向右旋转90度)。这些例程都是编号的,而不是命名的,但每个例程都有自己的颜色。


请注意我们在上面确定的问题, em> class_exists (例程219)在30%的样本上,通过查看火焰图并不明显。
更多的样品和不同的颜色会使图形看起来更火焰般,但不会通过从不同的地方被多次调用而暴露需要很多时间的例程。



这里是按功能而不是按时间排序的相同数据。
这有点帮助,但不会聚合从不同地方调用的相似性:


再次,目标是找到隐藏的问题。
任何人都可以找到简单的东西,但隐藏的问题是那些使所有的区别。



添加:另一种眼睛糖这一个:


其中黑色轮廓的例程都可以是相同的,只是从不同的地方调用。
图表不会为您聚合它们。
如果例程通过从不同地方调用大量次数而具有高包容率,则不会被暴露。


I'm trying to do some performance profiling on a poorly supported prototype embedded platform.

I note that GCC's -pg flag causes thunks to __gnu_mcount_nc to be inserted on entry to every function. No implementation of __gnu_mcount_nc is available (and the vendor is not interested in assisting), however as it is trivial to write one that simply records the stack frame and current cycle count, I have done so; this works fine and is yielding useful results in terms of caller/callee graphs and most frequently called functions.

I would really like to obtain information about the time spent in function bodies as well, however I am having difficulty understanding how to approach this with only the entry, but not the exit, to each function getting hooked: you can tell exactly when each function is entered, but without hooking the exit points you cannot know how much of the time until you receive the next piece of information to attribute to callee and how much to the callers.

Nevertheless, the GNU profiling tools are in fact demonstrably able to gather runtime information for functions on many platforms, so presumably the developers have some scheme in mind for achieving this.

I have seen some existing implementations that do things like maintain a shadow callstack and twiddle the return address on entry to __gnu_mcount_nc so that __gnu_mcount_nc will get invoked again when the callee returns; it can then match the caller/callee/sp triplet against the top of the shadow callstack and so distinguish this case from the call on entry, record the exit time and correctly return to the caller.

This approach leaves much to be desired:

  • it seems like it may be brittle in the presence of recursion and libraries compiled without the -pg flag
  • it seems like it would be difficult to implement with low overhead or at all in embedded multithreaded/multicore environments where toolchain TLS support is absent and current thread ID may be expensive/complex to obtain

Is there some obvious better way to implement a __gnu_mcount_nc so that a -pg build is able to capture function exit as well as entry time that I am missing?

解决方案

gprof does not use that function for timing, of entry or exit, but for call-counting of function A calling any function B. Rather, it uses the self-time gathered by counting PC samples in each routine, and then uses the function-to-function call counts to estimate how much of that self-time should be charged back to callers.

For example, if A calls C 10 times, and B calls C 20 times, and C has 1000ms of self time (i.e 100 PC samples), then gprof knows C has been called 30 times, and 33 of the samples can be charged to A, while the other 67 can be charged to B. Similarly, sample counts propagate up the call hierarchy.

So you see, it doesn't time function entry and exit. The measurements it does get are very coarse, because it makes no distinction between short calls and long calls. Also, if a PC sample happens during I/O or in a library routine that is not compiled with -pg, it is not counted at all. And, as you noted, it is very brittle in the presence of recursion, and can introduce notable overhead on short functions.

Another approach is stack-sampling, rather than PC-sampling. Granted, it is more expensive to capture a stack sample than a PC-sample, but fewer samples are needed. If, for example, a function, line of code, or any description you want to make, is evident on fraction F out of the total of N samples, then you know that the fraction of time it costs is F, with a standard deviation of sqrt(NF(1-F)). So, for example, if you take 100 samples, and a line of code appears on 50 of them, then you can estimate the line costs 50% of the time, with an uncertainty of sqrt(100*.5*.5) = +/- 5 samples or between 45% and 55%. If you take 100 times as many samples, you can reduce the uncertainty by a factor of 10. (Recursion doesn't matter. If a function or line of code appears 3 times in a single sample, that counts as 1 sample, not 3. Nor does it matter if function calls are short - if they are called enough times to cost a significant fraction, they will be caught.)

Notice, when you're looking for things you can fix to get speedup, the exact percent doesn't matter. The important thing is to find it. (In fact, you only need see a problem twice to know it is big enough to fix.)

That's this technique.


P.S. Don't get suckered into call-graphs, hot-paths, or hot-spots. Here's a typical call-graph rat's nest. Yellow is the hot-path, and red is the hot-spot.

And this shows how easy it is for a juicy speedup opportunity to be in none of those places:

The most valuable thing to look at is a dozen or so random raw stack samples, and relating them to the source code. (That means bypassing the back-end of the profiler.)

ADDED: Just to show what I mean, I simulated ten stack samples from the call graph above, and here's what I found

  • 3/10 samples are calling class_exists, one for the purpose of getting the class name, and two for the purpose of setting up a local configuration. class_exists calls autoload which calls requireFile, and two of those call adminpanel. If this can be done more directly, it could save about 30%.
  • 2/10 samples are calling determineId, which calls fetch_the_id which calls getPageAndRootlineWithDomain, which calls three more levels, terminating in sql_fetch_assoc. That seems like a lot of trouble to go through to get an ID, and it's costing about 20% of time, and that's not counting I/O.

So the stack samples don't just tell you how much inclusive time a function or line of code costs, they tell you why it's being done, and what possible silliness it takes to accomplish it. I often see this - galloping generality - swatting flies with hammers, not intentionally, but just following good modular design.

ADDED: Another thing not to get sucked into is flame graphs. For example, here is a flame graph (rotated right 90 degrees) of the ten simulated stack samples from the call graph above. The routines are all numbered, rather than named, but each routine has its own color.
Notice the problem we identified above, with class_exists (routine 219) being on 30% of the samples, is not at all obvious by looking at the flame graph. More samples and different colors would make the graph look more "flame-like", but does not expose routines which take a lot of time by being called many times from different places.

Here's the same data sorted by function rather than by time. That helps a little, but doesn't aggregate similarities called from different places:
Once again, the goal is to find the problems that are hiding from you. Anyone can find the easy stuff, but the problems that are hiding are the ones that make all the difference.

ADDED: Another kind of eye-candy is this one:
where the black-outlined routines could all be the same, just called from different places. The diagram doesn't aggregate them for you. If a routine has high inclusive percent by being called a large number of times from different places, it will not be exposed.

这篇关于使用__gnu_mcount_nc捕获函数退出时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆