linux perf:如何解释和查找热点 [英] linux perf: how to interpret and find hotspots
问题描述
我今天试用了linux的 perf 实用程序,但无法解释其结果。我习惯了valgrind的callgrind,这是一个完全不同的方法,以基于抽样的perf方法。
我做了什么:
perf record -g -p $(pidof someapp)
pre>
perf report -g -n
现在我看到这样的:
+ 16.92%kdevelop libsqlite3 .so.0.8.6 [。] 0x3fe57↑
+ 10.61%kdevelop libQtGui.so.4.7.3 [。] 0x81e344▮
+ 7.09%kdevelop libc-2.14.so [。] 0x85804▒
+ 4.96%kdevelop libQtGui.so.4.7.3 [。] 0x265b69▒
+ 3.50%kdevelop libQtCore.so.4.7.3 [。] 0x18608d▒
+ 2.68%kdevelop libc-2.14 .so [。] memcpy▒
+ 1.15%kdevelop [kernel.kallsyms] [k] copy_user_generic_string▒
+ 0.90%kdevelop libQtGui.so.4.7.3 [。] QTransform :: translate(double, double)▒
+ 0.88%kdevelop libc-2.14.so [。] __libc_malloc▒
+ 0.85%kdevelop libc-2.14.so [。] memcpy
...
好的,这些函数可能很慢,但是如何找到他们从哪里调用?由于所有这些热点位于外部库中,我看不到优化我的代码的方法。
基本上我正在寻找一种用累计成本注释的callgraph,其中我的函数比我调用的库函数具有更高的包含抽样成本。
这是可能与perf吗?如果是这样,怎么办?
注意:我发现E解开了callgraph并给出了一些更多的信息。但是呼号通常不够深,和/或随机终止,而不给出关于在哪里花费了多少信息的信息。示例:
- 10.26%kate libkatepartinterfaces.so.4.6.0 [。] Kate :: TextLoader :: readLine
Kate :: TextLoader :: readLine(int&,int&)
Kate :: TextBuffer :: load(QString const&,bool&,bool&)
KateBuffer :: openFile(QString const&)
KateDocument :: openFile()
0x7fe37a81121c
这是一个问题,我运行在64位? : http://lists.fedoraproject.org/pipermail/devel/2010-November/144952.html (我不使用fedora,但似乎适用于所有64位系统)。
解决方案最终能够使用DWARF信息生成调用图:
perf record --call-graph dwarf - yourapp
perf report -g graph --no-children
整洁,但是curses GUI VTune,KCacheGrind或类似...我建议尝试使用FlameGraphs,这是一个漂亮的可视化: http:// www .brendangregg.com / FlameGraphs / cpuflamegraphs.html
注意:在报告步骤中,
-g graph
使结果输出简单地理解相对于总百分比,而不是相对于父数字。- no-children
将只显示自我成本,而不是包容性成本 - 这个功能我也是无价的。
如果你有一个新的perf和Intel CPU,还要尝试LBR unwinder,它有更好的性能,并产生小得多的结果文件:
perf record --call-graph lbr - yourapp
这里的缺点是与默认的DWARF unwinder配置相比,调用堆栈深度更有限。
I tried out linux' perf utility today and am having trouble in interpreting its results. I'm used to valgrind's callgrind which is of course a totally different approach to the sampling based method of perf.
What I did:
perf record -g -p $(pidof someapp) perf report -g -n
Now I see something like this:
+ 16.92% kdevelop libsqlite3.so.0.8.6 [.] 0x3fe57 ↑ + 10.61% kdevelop libQtGui.so.4.7.3 [.] 0x81e344 ▮ + 7.09% kdevelop libc-2.14.so [.] 0x85804 ▒ + 4.96% kdevelop libQtGui.so.4.7.3 [.] 0x265b69 ▒ + 3.50% kdevelop libQtCore.so.4.7.3 [.] 0x18608d ▒ + 2.68% kdevelop libc-2.14.so [.] memcpy ▒ + 1.15% kdevelop [kernel.kallsyms] [k] copy_user_generic_string ▒ + 0.90% kdevelop libQtGui.so.4.7.3 [.] QTransform::translate(double, double) ▒ + 0.88% kdevelop libc-2.14.so [.] __libc_malloc ▒ + 0.85% kdevelop libc-2.14.so [.] memcpy ...Ok, these functions might be slow, but how do I find out where they are getting called from? As all these hotspots lie in external libraries I see no way to optimize my code.
Basically I am looking for some kind of callgraph annotated with accumulated cost, where my functions have a higher inclusive sampling cost than the library functions I call.
Is this possible with perf? If so - how?
Note: I found out that "E" unwraps the callgraph and gives somewhat more information. But the callgraph is often not deep enough and/or terminates randomly without giving information about how much info was spent where. Example:
- 10.26% kate libkatepartinterfaces.so.4.6.0 [.] Kate::TextLoader::readLine(int&... Kate::TextLoader::readLine(int&, int&) Kate::TextBuffer::load(QString const&, bool&, bool&) KateBuffer::openFile(QString const&) KateDocument::openFile() 0x7fe37a81121cCould it be an issue that I'm running on 64 bit? See also: http://lists.fedoraproject.org/pipermail/devel/2010-November/144952.html (I'm not using fedora but seems to apply to all 64bit systems).
解决方案With Linux 3.7 perf is finally able to use DWARF information to generate the callgraph:
perf record --call-graph dwarf -- yourapp perf report -g graph --no-children
Neat, but the curses GUI is horrible compared to VTune, KCacheGrind or similar... I recommend to try out FlameGraphs instead, which is a pretty neat visualization: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
Note: In the report step,
-g graph
makes the results output simple to understand "relative to total" percentages, rather than "relative to parent" numbers.--no-children
will show only self cost, rather than inclusive cost - a feature that I also find invaluable.If you have a new perf and Intel CPU, also try out the LBR unwinder, which has much better performance and produces far smaller result files:
perf record --call-graph lbr -- yourapp
The downside here is that the call stack depth is more limited compared to the default DWARF unwinder configuration.
这篇关于linux perf:如何解释和查找热点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!