使用 linux perf 和不同的调用图模式分析我的程序会给出不同的结果 [英] Profiling my program with linux perf and different call graph modes gives different results

查看:95
本文介绍了使用 linux perf 和不同的调用图模式分析我的程序会给出不同的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用 linux perf 来分析我的 c++ 程序.为此,我使用了以下三个命令,但我不明白为什么我会得到三个完全不同的报告.

perf record --call-graph dwarf ./myProg性能报告

perf record --call-graph fp ./myProg性能报告

perf record --call-graph lbr ./myProg性能报告

我也不明白为什么 main 函数不是列表中最高级的函数.

我的程序的逻辑如下,main 函数调用 getPogDocumentFromFile 函数调用 fromPoxml 调用 toPred 调用 applySubst 调用 subst.此外,toPredapplySubstsubst 是递归函数.我希望它们成为瓶颈.

更多评论:我的程序运行了大约 25 分钟,它是高度递归的,并分配了大量 (~17Go) 内存.此外,我使用 -fno-omit-frame-pointer 编译并使用最新的英特尔 CPU.

有什么想法吗?

再次思考我的问题,我意识到我不明白儿童专栏的含义.

到目前为止,我假设 Self 列是我们在调用堆栈顶部查看的函数的样本百分比,Children 列是调用堆栈中任何位置的函数的样本百分比.显然情况并非如此,否则主函数的子列将离 100% 不远.也许调用堆栈被截断了?还是我完全误解了分析器的工作原理?

解决方案

pref report 的手册页记录了调用链显示的子项累积:

<块引用>

 --children将子项的调用链累积到父项,以便可以显示在输出中.输出将有一个新的孩子"列并将按数据排序.它需要调用链是记录.有关更多信息,请参阅间接费用计算"部分细节.默认启用,使用 --no-children 禁用.

我可以建议您使用 --no-children 选项的 perf report(或 perf top -g --no-children -p$PID_OF_PROGRAM)

因此在默认模式下,当 perf.data 文件中有一些调用链数据时,perf 报告将计算self"和self+children"开销并对累积的数据进行排序.这意味着,如果某个函数 f1() 有 10% 的self"样本,并调用一些带有 20%self"样本的叶函数 f2(),则 f1() self+children 将是 30%.累积数据适用于提及当前函数的所有堆栈:用于其本身完成的工作,以及在所有直接和间接子代(后代)中的工作.

您可以在--call-graph选项(dwarf/lbr/fp)中指定一些调用堆栈采样方法,它们可能有一些限制.有时方法(尤其是 fp)可能无法提取部分调用堆栈.-fno-omit-frame-pointer 选项可能会有所帮助,但是当它在您的可执行文件中使用而不是在某些带有回调的库中时,调用堆栈将被部分提取.一些很长的调用链可能不会被某些方法提取出来.或者 perf report 可能无法处理某些情况.

要检查截断的调用链示例,请在中间的某个位置使用 perf script|less.在这种模式下,它会使用所有检测到的函数名称打印每个记录的样本,检查不以 main__libc_start_main 结尾的样本 - 它们被截断.

<块引用>

否则主函数的子列就会离 100% 不远

是的,对于单线程程序和正确记录和处理的调用堆栈,main 在Children"列中应该有类似 99% 的内容.对于多线程程序,第二个和其他线程将有另一个根节点,如 start_thread.

I want to profile my c++ program with linux perf. For this I used the three following commands and I do not understand why I get three completely different reports.

perf record --call-graph dwarf ./myProg
perf report

perf record --call-graph fp ./myProg
perf report

perf record --call-graph lbr ./myProg
perf report

Also I do not understand why the main function is not the highest function in the list.

The logic of my program is the following, the main function calls the getPogDocumentFromFile function which calls fromPoxml which calls toPred which calls applySubst which calls subst. Moreover toPred, applySubst and subst are recursive functions. And I expect them to be the bottleneck.

Some more comments: my program runs about 25 minutes, it is highly recursive and allocates a lot (~17Go) of memory. Also I compile with -fno-omit-frame-pointer and use a recent intel CPU.

Any Idea?

EDIT:

Thinking again about my question, I realize that I do not understand the meaning of the Children column.

So far I assumed that the Self column was the percentage of samples with the function we are looking at at the top of the call stack and the Children column was the percentage of samples with the function anywhere in the call stack. Obviously this is not the case, otherwise the main function would have its children column not far from 100%. Maybe the callstack is truncated? Or am I completely misunderstanding how profilers work?

解决方案

Man page of pref report documents the call chains display with children accumulation:

  --children
       Accumulate callchain of children to parent entry so that then can
       show up in the output. The output will have a new "Children"
       column and will be sorted on the data. It requires callchains are
       recorded. See the ‘overhead calculation’ section for more
       details. Enabled by default, disable with --no-children.

I can recommend you to try non-default mode with --no-children option of perf report (or perf top -g --no-children -p $PID_OF_PROGRAM)

So in default mode when there is some callchain data in perf.data file, perf report will calculate "self" and "self+children" overhead and sort on accumulated data. It means that if some function f1() has 10% of "self" samples and calls some leaf function f2() with 20% of "self" samples, then f1() self+children will be 30%. Accumulated data is for all stacks where current function was mentioned: for the work done in it itself, and work in all direct and indirect children (descendants).

You can specify some of call stack sampling method in --call-graph option (dwarf / lbr / fp), and they may have some limitations. Sometimes methods (especially fp) may fail to extract parts of call stack. -fno-omit-frame-pointer option may help, but when it is used in your executable but not in some library with callback, then call stack will be extracted partially. Some very long call chains may be not extracted too by some methods. Or perf report may fail to handle some cases.

To check for truncated call chain samples, use perf script|less somewhere in the middle. In this mode it does print every recorded sample with all detected function names, check for samples not ending with main and __libc_start_main - they are truncated.

otherwise the main function would have its children column not far from 100%

Yes, for single threaded program and correctly recorded and processed call stacks, main should have something like 99% in "Children" column. For multithreaded programs second and other threads will have another root node like start_thread.

这篇关于使用 linux perf 和不同的调用图模式分析我的程序会给出不同的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆