省略帧指针是否真的对性能有正面影响,而对调试能力却有负面影响? [英] Does omitting the frame pointers really have a positive effect on performance and a negative effect on debug-ability?

查看:109
本文介绍了省略帧指针是否真的对性能有正面影响,而对调试能力却有负面影响?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正如很久以前所建议的那样,我总是在没有框架指针的情况下构建发布可执行文件(如果使用/Ox进行编译,则这是默认设置).

As was advised long time ago, I always build my release executables without frame pointers (which is the default if you compile with /Ox).

但是,现在我读了论文 http://research.microsoft.com/apps/pubs/default.aspx?id=81176 ,表示帧指针对性能没有太大影响.因此,完全优化(使用/Ox)或完全使用帧指针(使用/Ox/Oy-)进行优化并不会对性能产生任何影响.

However, now I read in the paper http://research.microsoft.com/apps/pubs/default.aspx?id=81176, that frame pointers don't have much of an effect on performance. So optimizing it fully (using /Ox) or optimizing it fully with frame pointers (using /Ox /Oy-) doesn't really make a difference on peformance.

Microsoft似乎表明添加帧指针(/Oy-)使调试更加容易,但这是真的吗?

Microsoft seems to indicate that adding frame pointers (/Oy-) makes debugging easier, but is this really the case?

我做了一些实验,发现:

I did some experiments and noticed that:

  • 在一个简单的测试可执行文件(使用/Ox/Ob0编译)中,省略帧指针确实可以提高性能(大约10%).但是此测试可执行文件仅执行一些功能调用,而没有执行其他操作.
  • 在我自己的应用程序中,添加/删除帧指针似乎没有太大影响.添加帧指针似乎可以使应用程序快大约5%,但这可能在错误范围之内.

关于帧指针的一般建议是什么?

What is the general advice regarding frame pointers?

  • 是否应该在发行版可执行文件中忽略它们(/Ox),因为它们确实会对性能产生积极影响?
  • 是否应将它们添加(/Ox/Oy-)到发行版可执行文件中,因为它们提高了调试能力(使用崩溃转储文件进行调试时)?

使用Visual Studio 2010.

Using Visual Studio 2010.

推荐答案

简短答案:通过省略帧指针,

您需要使用堆栈指针来访问局部变量和参数.编译器不介意,但是如果您使用assember进行编码,则这会使您的生活稍微困难一些.如果不使用宏,则要困难得多.

You need to use the stack pointer to access local variables and arguments. The compiler doesn't mind, but if you are coding in assember, this makes your life slightly harder. Much harder if you don't use macros.

每个函数调用可节省四个字节(32位体系结构)的堆栈空间.除非您使用深度递归,否则这不是胜利.

You save four bytes (32-bit architecture) of stack space per function call. Unless you are using deep recursion, this isn't a win.

您将内存写入保存到缓存的内存(堆栈),并且(理论上)在函数进入/退出时保存了一些时钟滴答,但是您可以增加代码大小.除非您的函数很少做很多事情(在这种情况下,应该内联),否则这一点应该不会引起注意.

You save a memory write to a cached memory (the stack) and you (theoretically) save a few clock ticks on function entry/exit, but you can increase the code size. Unless your function is doing very little very often (in which case it should be inlined), this shouldn't be noticable.

您释放了通用寄存器.如果编译器可以利用寄存器,它将产生既小得多又可能更快的代码.但是,如果大部分的CPU时间都花在了与主内存(甚至硬盘驱动器)的通讯上,那么省略帧指针并不会节省您的时间.

You free up a general purpose register. If the compiler can utilize the register, it will produce code that is both substantially smaller and potentially faster. But, if most of the CPU time is spent talking to the main memory (or even the hard drive), omitting the frame pointer is not going save you from that.

调试器将失去一种简单的方法来生成堆栈跟踪.调试器仍可能能够从其他来源(例如 PDB文件).

The debugger will lose an easy way to generate the stack trace. The debugger might still be able to able to generate the stack trace from a different source (such as a PDB file).

详细答案:

典型的函数入口和出口是:

The typical function entry and exit is:

PUSH SP   ;push the frame pointer
MOV FP,SP ;store the stack pointer in the frame pointer
SUB SP,xx ;allocate space for local variables et al.
...
LEAVE     ;restore the stack pointer and pop the old frame pointer
RET       ;return from the function

没有堆栈指针的进入和退出看起来像:

An entry and exit without a stack pointer could look like:

SUB SP,xx ;allocate space for local variables et al.
...
ADD SP,xx ;de-allocate space for local variables et al.
RET       ;return from the function.

您将保存两条指令,但是您还复制了一个字面值,因此代码不会变得更短(相反),但是您可能已经保存了几个时钟周期(或者,如果不是这样,则会导致高速缓存未命中).指令缓存).不过,您确实在堆栈上节省了一些空间.

You will save two instructions but you also duplicate a literal value so the code doesn't get shorter (quite the opposite), but you might have saved a few clock cycles (or not, if it causes a cache miss in the instruction cache). You did save some space on the stack, though.

您确实释放了通用寄存器.这只有好处.

You do free up a general purpose register. This has only benefits.

在regcall/fastcall中,这是一个额外的寄存器,您可以在其中存储函数的参数.因此,如果函数使用七个(在x86上;在大多数其他体系结构上更多)或更多参数(包括this),则第七个参数仍适合寄存器.同样,更重要的是,它也适用于局部变量.数组和大对象不适合寄存器(但是指向它们的指针适合),但是如果您的函数使用七个不同的局部变量(包括计算复杂表达式所需的临时变量),则编译器可能会产生较小的代码.较小的代码意味着较低的指令高速缓存占用空间,这意味着降低了未命中率,从而减少了内存访问量(但 Intel Atom具有32K指令缓存,这意味着您的代码可能仍然可以容纳).

In regcall/fastcall, this is one extra register where you can store arguments to your function. Thus, if your function takes seven (on x86; more on most other architectures) or more arguments (including this), the seventh argument still fits into a register. The same, more importantly, applies to local variables as well. Arrays and large objects don't fit into registers (but pointers to them do), but if your function is using seven different local variables (including temporary variables needed to calculate complex expressions), chances are the compiler will be able to produce smaller code. Smaller code means lower instruction cache footprint, which means reduced miss rate and thus even less memory access (but Intel Atom has a 32K instruction cache, meaning that your code will probably fit anyways).

x86体系结构具有[BX/BP/SI/DI][BX/BP + SI/DI]寻址模式.这使BP寄存器成为缩放数组索引的极有用处,尤其是当数组指针驻留在SI或DI寄存器中时.两个偏移寄存器比一个好.

The x86 architecture features the [BX/BP/SI/DI] and [BX/BP + SI/DI] addressing modes. This makes the BP register an extremely useful place for a scaled array index, especially if the array pointer resides in the SI or DI registers. Two offset registers are better than one.

使用寄存器可以避免内存访问,但是如果值得将变量存储在寄存器中,则它有可能在L1缓存中存活得一样好(特别是因为它将要在堆栈中).移入/移出高速缓存仍然存在成本,但是由于现代CPU进行了大量的移动优化和并行化,因此L1访问可能与寄存器访问一样快.因此,仍然存在不移动数据带来的速度优势,但并没有那么大.我可以轻易地想象到CPU至少在读取方面完全避免了数据缓存(并且写入缓存可以并行完成).

Utilising a register avoids memory access, but if a variable is worth storing in a register, chances are it will survive just as fine in an L1 cache (especially since it's going to be on the stack). There is still the cost of moving to/from the cache, but since modern CPUs do a lot move optimisation and parallelisation, it is possible that an L1 access would be just as fast as a register access. Thus, the speed benefit from not moving data around is still present, but not as enormous. I can easily imagine the CPU avoiding the data cache completely, at least as far as reading is concerned (and writing to cache can be done in parallel).

被利用的寄存器是需要保存的寄存器.如果要在再次使用之前将其压入堆栈,则不值得在寄存器中存储太多内容.在按调用者保留的调用约定(例如上述约定)中,这意味着作为持久性存储的寄存器在大量调用其他函数的函数中没有用.

A register that is utilised is a register that needs preserving. It is not worth storing much in the registers if you are going to push it to the stack anyways before you use it again. In preserve-by-caller calling conventions (such as the one above), this means that registers as persistent storage are not as useful in a function that calls other functions a lot.

还要注意,x86为浮点寄存器提供了一个单独的寄存器空间,这意味着无论如何,如果没有额外的数据移动指令,浮点数将无法利用BP寄存器.只有整数和内存指针才可以.

Also note that x86 has a separate register space for floating point registers, meaning that floats cannot utilise the BP register without extra data movement instructions anyways. Only integers and memory pointers do.

省略帧指针会导致可调试性下降. 此答案显示原因:

What you do lose by omitting frame pointers is debugability. This answer show why:

如果代码崩溃,调试器生成堆栈跟踪所需要做的全部工作是:

If the code crashes, all the debugger needs to do to generate the stack trace is:

    PUSH FP      ; log the current frame pointer as well
$1: CALL log_FP  ; log the frame pointer currently on stack
    LEAVE        ; pop the frame pointer to get the next one
    CMP [FP+4],0
    JNZ $1       ; until the stack cannot be popped (the return address is some specific value)

如果代码在没有帧指针的情况下崩溃,则调试器可能无法生成堆栈跟踪,因为它可能不知道(即,它需要定位函数入口/出口点)需要从堆栈中减去多少.堆栈指针.如果调试器不知道未使用帧指针,则它甚至可能崩溃.

If the code crashes without a frame pointer, the debugger might have no way to generate the stack trace because it might not know (namely, it needs to locate the function entry/exit point) how much needs to be subtracted from the stack pointer. If the debugger doesn't know the frame pointer is not being used, it might even crash itself.

这篇关于省略帧指针是否真的对性能有正面影响,而对调试能力却有负面影响?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆