KCachegrind输出,用于优化版本和未优化版本 [英] KCachegrind output for optimized vs unoptimized builds

查看:122
本文介绍了KCachegrind输出,用于优化版本和未优化版本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在由以下代码生成的可执行文件上运行valgrind --tool=callgrind ./executable:

I run valgrind --tool=callgrind ./executable on the executable file generated by the following code:

#include <cstdlib>
#include <stdio.h>
using namespace std;

class XYZ{
public:
    int Count() const {return count;}
    void Count(int val){count = val;}
private:
    int count;
};

int main() {
    XYZ xyz;
    xyz.Count(10000);
    int sum = 0;
    for(int i = 0; i < xyz.Count(); i++){
//My interest is to see how the compiler optimizes the xyz.Count() call
        sum += i;
    }
    printf("Sum is %d\n", sum);
    return 0;
}

我使用以下选项构建debug构建:-fPIC -fno-strict-aliasing -fexceptions -g -std=c++14. release构建具有以下选项:-fPIC -fno-strict-aliasing -fexceptions -g -O2 -std=c++14.

I make a debug build with the following options: -fPIC -fno-strict-aliasing -fexceptions -g -std=c++14. The release build is with the following options: -fPIC -fno-strict-aliasing -fexceptions -g -O2 -std=c++14.

运行valgrind会生成两个转储文件.当在KCachegrind中查看这些文件(一个文件用于调试可执行文件,另一个文件用于发行可执行文件)时,调试版本是可以理解的,如下所示:

Running valgrind generates two dump files. When these files (one file for debug executable, the other for release executable) are viewed in KCachegrind, the debug build is understandable as shown below:

按预期,函数XYZ::Count() const被调用10001次.但是,优化后的发行版本很难破解,并且尚不清楚该函数被调用了多少次.我知道函数调用可能是inlined.但是,如何弄清楚事实已经内联了呢?发布版本的调用图如下所示:

As expected, the function XYZ::Count() const is called 10001 times. However, the optimized release build is much harder to decipher and it is not clear how many times the function is called at all. I am aware that the function call might be inlined. But how does one figure out that it has infact been inlined? The callgraph for the release build is as shown below:

main()似乎完全没有功能XYZ::Count() const的指示.

There seems to be no indication of function XYZ::Count() const at all from main().

我的问题是:

(1)如果不查看调试/发布版本生成的汇编语言代码,而没有使用KCachegrind,那么如何弄清楚某个特定函数(在本例中为XYZ::Count() const)被调用了多少次?在上面的发布构建调用图中,该函数甚至没有被调用一次.

(1)Without looking at the assembly language code generated by the debug/release builds, and by using using KCachegrind, how can one figure out how many times a particular function, (in this case XYZ::Count() const) is called? In the release build call graph above, the function is not even called once.

(2)是否有办法了解KCachegrind为发行/优化版本提供的调用图和其他详细信息?我已经查看了 https://docs上的KCachegrind手册. kde.org/trunk5/zh-CN/kdesdk/kcachegrind/kcachegrind.pdf ,但我想知道在发行版本中是否应该找到一些有用的技巧/经验法则.

(2)Is there a way to understand the callgraph and other details provided by KCachegrind for release/optimized builds? I have already looked at the KCachegrind manual available at https://docs.kde.org/trunk5/en/kdesdk/kcachegrind/kcachegrind.pdf, but I was wondering if there are some useful hacks/rules of thumb that one should look for in release builds.

推荐答案

valgrind的输出很容易理解:正如valgrind + kcachegrind告诉您的那样,在发行版本中根本没有调用此函数.

The output of valgrind is easy to understand: As valgrind+kcachegrind are telling you, this function was not called at all in the release build.

问题是,您所说的叫什么意思?如果内联一个函数,它是否仍被调用"?实际上,情况似乎更加复杂,乍一看,您的示例并不那么琐碎.

The question is, what do you mean by called? If a function is inlined, is it still "called"? Actually, the situation is more complex, as it seems at the first sight and your example isn't that trivial.

Count()是否已嵌入发行版中?当然可以.在优化过程中,代码转换通常非常出色,例如您的情况-判断的最佳方法是查看生成的

Was Count() inlined in the release build? Sure, kind of. The code transformation during the optimization is often quite remarkable, like in your case - and the best way to judge, is to look into the resulting assembler (here for clang):

main:                                   # @main
        pushq   %rax
        leaq    .L.str(%rip), %rdi
        movl    $49995000, %esi         # imm = 0x2FADCF8
        xorl    %eax, %eax
        callq   printf@PLT
        xorl    %eax, %eax
        popq    %rcx
        retq
.L.str:
        .asciz  "Sum is %d\n"

您可以看到,main根本不执行for循环,而只是打印结果(49995000),该结果是在优化过程中计算的,因为在编译过程中迭代次数是已知的时间.

You can see, that the main doesn't execute the for-loop at all, but just prints the result (49995000), which is calculated during the optimization because the number of iterations is known during the compile-time.

Count()是内联的吗?是的,在优化的最初步骤中的某个地方,但是代码变成了完全不同的东西-在最终的汇编器中没有内嵌Count()的地方.

So was Count() inlined? Yes, somewhere during the first steps of optimization, but then the code became something completely different - there is no place where Count() was inlined in the final assembler.

那么,当我们隐藏"编译器的迭代次数时会发生什么呢?例如.通过命令行传递它:

So what happens, when we "hide" the number of iteration from compiler? E.g. pass it via the command line:

...
int main(int argc,  char* argv[]) {
   XYZ xyz;
   xyz.Count(atoi(argv[1]));
...

在生成的汇编程序中,我们仍然没有遇到for循环,因为优化程序可以弄清楚,Count()的调用没有副作用,可以优化整个过程:

In the resulting assembler, we still don't encounter a for-loop, because the optimizer can figure out, that the call of Count() doesn't have side-effect and optimizes the whole thing out:

main:                                   # @main
        pushq   %rbx
        movq    8(%rsi), %rdi
        xorl    %ebx, %ebx
        xorl    %esi, %esi
        movl    $10, %edx
        callq   strtol@PLT
        testl   %eax, %eax
        jle     .LBB0_2
        leal    -1(%rax), %ecx
        leal    -2(%rax), %edx
        imulq   %rcx, %rdx
        shrq    %rdx
        leal    -1(%rax,%rdx), %ebx
.LBB0_2:
        leaq    .L.str(%rip), %rdi
        xorl    %eax, %eax
        movl    %ebx, %esi
        callq   printf@PLT
        xorl    %eax, %eax
        popq    %rbx
        retq
.L.str:
        .asciz  "Sum is %d\n"

优化器为总和i=0..n-1提出了公式(n-1)*(n-2)/2

The optimizer came up with the formula (n-1)*(n-2)/2 for the sum i=0..n-1!

现在让我们在单独的翻译单元class.cpp中隐藏Count()的定义,因此优化程序无法看到其定义:

Let's now hide the definition of Count() in an separate translation unit class.cpp, so the optimizer cannot see it's definition:

class XYZ{
public:
    int Count() const;//definition in separate translation unit
...

现在,我们在每个迭代中获得for循环和对Count()的调用,这是的最重要部分汇编器是:

Now we get our for-loop and a call to Count() in every iteration, the most important part of the assembler is:

.L6:
        addl    %ebx, %ebp
        addl    $1, %ebx
.L3:
        movq    %r12, %rdi
        call    XYZ::Count() const@PLT
        cmpl    %eax, %ebx
        jl      .L6

在每个迭代步骤中,将Count()(在%rax中)的结果与当前计数器(在%ebx中)进行比较.现在,如果使用valgrind运行它,我们可以在被调用者列表中看到XYZ::Count()被调用了10001次.

The result of the Count() (in %rax) is compared to the current counter (in %ebx) in every iteration step. Now, if we run it with valgrind we can see in the list of callees, that XYZ::Count() was called 10001 times.

但是,对于现代工具链来说,仅看到单个翻译单元的汇编器是不够的-有一个叫做link-time-optimization的东西.我们可以通过按照以下方式构建某个地方来使用它:

However, for modern tool-chains it is not enough to see the assembler of the single translation units - there is a thing called link-time-optimization. We can use it by building somewhere along these lines:

gcc -fPIC -g -O2 -flto -o class.o -c class.cpp
gcc -fPIC -g -O2 -flto -o test.o  -c test.cpp
gcc -g -O2 -flto -o test_r class.o test.o

然后使用valgrind运行生成的可执行文件,我们再次看到,未调用Count()

And running the resulting executable with valgrind we once again see, that Count() was not called!

但是查看机器代码(在这里我使用gcc,我的clang安装似乎与lto有关):

However looking into the machine code (here I used gcc, my clang-installation seems to have an issue with lto):

00000000004004a0 <main>:
  4004a0:   48 83 ec 08             sub    $0x8,%rsp
  4004a4:   48 8b 7e 08             mov    0x8(%rsi),%rdi
  4004a8:   ba 0a 00 00 00          mov    $0xa,%edx
  4004ad:   31 f6                   xor    %esi,%esi
  4004af:   e8 bc ff ff ff          callq  400470 <strtol@plt>
  4004b4:   85 c0                   test   %eax,%eax
  4004b6:   7e 2b                   jle    4004e3 <main+0x43>
  4004b8:   89 c1                   mov    %eax,%ecx
  4004ba:   31 d2                   xor    %edx,%edx
  4004bc:   31 c0                   xor    %eax,%eax
  4004be:   66 90                   xchg   %ax,%ax
  4004c0:   01 c2                   add    %eax,%edx
  4004c2:   83 c0 01                add    $0x1,%eax
  4004c5:   39 c8                   cmp    %ecx,%eax
  4004c7:   75 f7                   jne    4004c0 <main+0x20>
  4004c9:   48 8d 35 a4 01 00 00    lea    0x1a4(%rip),%rsi        # 400674 <_IO_stdin_used+0x4>
  4004d0:   bf 01 00 00 00          mov    $0x1,%edi
  4004d5:   31 c0                   xor    %eax,%eax
  4004d7:   e8 a4 ff ff ff          callq  400480 <__printf_chk@plt>
  4004dc:   31 c0                   xor    %eax,%eax
  4004de:   48 83 c4 08             add    $0x8,%rsp
  4004e2:   c3                      retq   
  4004e3:   31 d2                   xor    %edx,%edx
  4004e5:   eb e2                   jmp    4004c9 <main+0x29>
  4004e7:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)

我们可以看到,对函数Count()的调用已内联,但是-仍然存在for循环(我想这是gcc vs clang的东西).

We can see, that the call to the function Count() was inlined but - there is still a for-loop (I guess this is a gcc vs clang thing).

但是您最感兴趣的是:函数Count()仅被调用"一次-它的值保存到寄存器%ecx中,而循环实际上只是:

But what is of most interest to you: the function Count() is "called" only once - its value is saved to register %ecx and the loop is actually only:

  4004c0:   01 c2                   add    %eax,%edx
  4004c2:   83 c0 01                add    $0x1,%eax
  4004c5:   39 c8                   cmp    %ecx,%eax
  4004c7:   75 f7                   jne    4004c0 <main+0x20>

如果valgrind与选项`--dump-instr = yes一起运行,您还可以在Kcachegrid的帮助下看到所有这些信息.

This all you could also see with help of Kcachegrid, if valgrind were run with option `--dump-instr=yes.

这篇关于KCachegrind输出,用于优化版本和未优化版本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆