GCC汇编优化 - 为什么这些相同呢? [英] GCC Assembly Optimizations - Why are these equivalent?

查看:145
本文介绍了GCC汇编优化 - 为什么这些相同呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想学习如何组装在一个初级水平的作品,所以我一直在玩的gcc编译的-S输出。我写了一个简单的程序,定义了两个字节并返回它们的总和。整个程序如下:

  INT主要(无效){
  所以char a = 5;
  焦炭B = 10;
  回归A + B;
}

当我用没有优化编译如下:

  gcc的-O0 -S -c test.c的

我得到test.s,看起来像以下内容:

  .filetest.c的
    .DEF ___main; .scl伪2; .TYPE 32; .endef伪
    。文本
    .globl _main
    .DEF _main; .scl伪2; .TYPE 32; .endef伪
_主要:
LFB0:
    .cfi_startproc
    pushl%EBP
    .cfi_def_cfa_offset 8
    .cfi_offset 5,-8
    MOVL%ESP,EBP%
    .cfi_def_cfa_register 5
    和L $ -16,ESP%
    subl $ 16%ESP
    调用___main
    MOVB $ 5,15(%ESP)
    MOVB $ 10,14(%ESP)
    movsbl 15(%ESP),EDX%
    movsbl 14(%ESP),EAX%
    ADDL%EDX,EAX%
    离开
    .cfi_restore 5
    .cfi_def_cfa 4,4
    RET
    .cfi_endproc
LFE0:
    .identGCC(GNU)4.9.2

现在,认识到这程序可以很容易简化为只返回一个常数(15)我已经能够用手减少装配使用该code来执行相同的功能:

 。全球_main
_主要:
    MOVL $ 15%EAX
    RET

这在我看来是code可能最少的(但我意识到可能是完全错误)来执行这一无可否认的简单的任务。这是我的形式C程序中最优化的版本?

为什么GCC的初始输出这么多详细?什么从.cfi_startproc跨越线路​​即使call__main办?什么是呼叫__main吗?我想不出有什么两个减法运算是。

即使在GCC的优化设置为-O3我得到这样的:

  .filetest.c的
    .DEF ___main; .scl伪2; .TYPE 32; .endef伪
    .section伪.text.unlikely,×
LCOLDB0:
    .section伪.text.startup,×
LHOTB0:
    .p2align 4日,15
    .globl _main
    .DEF _main; .scl伪2; .TYPE 32; .endef伪
_主要:
LFB0:
    .cfi_startproc
    pushl%EBP
    .cfi_def_cfa_offset 8
    .cfi_offset 5,-8
    MOVL%ESP,EBP%
    .cfi_def_cfa_register 5
    和L $ -16,ESP%
    调用___main
    MOVL $ 15%EAX
    离开
    .cfi_restore 5
    .cfi_def_cfa 4,4
    RET
    .cfi_endproc
LFE0:
    .section伪.text.unlikely,×
LCOLDE0:
    .section伪.text.startup,×
LHOTE0:
    .identGCC(GNU)4.9.2

这似乎已删除了一些操作,但仍有导致调用__main,似乎没有必要在所有行。 什么是所有.cfi_XXX线?为什么那么多的标签添加?什么.section伪,.ident,.DEF .p2align,等等,等等吗?

据我了解,很多标签和符号都包含了调试,但不应这些被剥离或省略,如果我不跟编译-g启用?


更新

要澄清,说


  

这在我看来是code可能最少的(但我
  REALIZE可能是相当错误的)来执行这一无可否认简单的任务。
  这是我的形式C程序中最优化的版本?


我并不是说,我想,或者已经实现,这个程序的优化版本。我知道程序是无用的,微不足道的。我只是用它作为一种工具来学习汇编和编译器是如何工作的。

为什么我加了该位的核心是要说明为什么我很困惑,这个组装code的4行版本可以有效地实现与他人相同的效果。这在我看来,GCC增加了很多的东西,其目的我不能辨别。


解决方案

感谢您,Kin3TiX,询问的ASM-新手的问题,这不只是一个$ C $一些讨厌code的C-转储,无注释和一个非常简单的问题。 :)

作为一个办法让你的脚湿ASM,我建议比其它主要功能的工作。例如只是一个函数,采用两个整型ARGS,并增加了他们。那么编译器不能优化它拿走。你仍然可以使用常数作为ARGS调用它,如果它在不同的文件是从,它不会内联,所以你甚至可以单步执行它。

有一些好处,以了解发生了什么事情在ASM水平,当你编译,但比嵌入式系统等,你永远只能去编写优化的内部循环ASM中。国际海事组织,还有使用ASM,如果你不打算优化地狱出来的小点。否则,你可能不会从源代码这是更易于阅读击败编译器输出。

理解编译器输出的其他窍门:用结果编译
  GCC -S -fno堆栈检查-fverbose-ASM 。每个指令后的意见往往是什么负荷是很好的提醒。 pretty很快就退化成临时的像 D.2983 名称的混乱,但类似结果
MOVQ 8(%RDI),%RCX#A_1(D) - GT;元素,A_1(D) - GT;元素会为你节省往返的ABI看看哪些参考ARG功能来在%RDI ,且结构成员在偏移8。


  

什么是从.cfi_startproc跨越线路​​即使call__main办?


  _main:
LFB0:
    .cfi_startproc
    pushl%EBP
    .cfi_def_cfa_offset 8
    .cfi_offset 5,-8
    MOVL%ESP,EBP%
    .cfi_def_cfa_register 5

正如其他人所说, .cfi 的东西是调试信息。它的东西,将从您的二进制文件删除,或者说不会在那里摆在首位,如果你没有使用 -g 。 IDK为什么他们在 -S 输出在那里,没有 -g 。我经常看从 objdump的-d 输出ASM,而不是 GCC -S 。通常是因为我可以基准可执行文件并查看其汇编,而不需要调用 GCC 多次。

与推的%ebp ,然后将其设置为对函数入口的堆栈指针的值的东西,建立了什么叫做栈帧。这就是为什么的%ebp 被称为基指针。如果您使用 -fomit-frame-pointer的,这给了code额外的寄存器一起工作编译这些insn则不会在那里。 (这是巨大的32位x86的,因为从6〜7暂存器需要你(%ESP 仍绑起来是堆栈指针;在XMM暂时藏起来,或MMX章,然后用它作为其他GP章是可能的,但你的code将很难调试!)

离开指令 RET 是一部分也是这个堆栈帧的东西。

我不是帧指针的目的完全清楚。随着调试符号,你可以回溯调用栈只是甚至与 -fomit-frame-pointer的罚款,这是对AMD64的缺省值。 (AMD64位ABI对堆栈对齐要求,在其他方面好了很多了。例如,在暂存器,而不是在堆栈上传递ARGS)。

 和L $ -16%ESP
    subl $ 16%ESP

对齐堆栈到16字节边界,不论它是什么了。在保留堆栈此功能的16个字节。 (请注意,它是如何从优化版本缺少,因为它远离优化任何需要的任何变量的存储器。)

 通话___main

_main (ASM名= __主)是调用的东西构造可能是一个gcc的运行时库函数需要它。也许图书馆设置的东西,它可能是在那里任何自己的全局/静态变量的构造是由调用。 (这旧邮件列表的消息指示 _main 是构造函数,但它的主要不应该称呼它在支持获得启动code调用它的平台,也许I386不具有,只有AMD64 ?)编辑:你在评论说,这来自cygwin的。这可以解释它,因为cygwin的必须做出非ELF .EXE文件。

  MOVB $ 5,15(%ESP)
    MOVB $ 10,14(%ESP)
    movsbl 15(%ESP),EDX%
    movsbl 14(%ESP),EAX%
    ADDL%EDX,EAX%
    离开
    RET


  

为什么GCC的初始输出这么多详细的?


如果没有启用优化,GCC地图C语句字面上尽可能到ASM。做任何事情都要将采取更多的编译时间。因此, MOVB 从初始化为你的两个变量。返回值是通过做两个负载计算(用符号扩展,因为我们需要上变频为int的添加前,要配合C code的语义写的,至于溢出)。


  

我不图什么两个减法运算是。


有只有一个指令。它保留堆栈函数的变量上的空间,在调用 __ main之前。还有哪些子你在说什么?


  

做什么.section伪,.ident,.DEF .p2align,等等,等等呢?


请参阅GNU汇编的手册。也可作为本地信息页:运行信息燃气

.ident .DEF :看起来像GCC把它的邮票对象文件,所以你可以告诉什么样的编译器/汇编器生成它。不相关,忽略这些。

.section伪:决定ELF对象的哪一部分来自以下指令或数据文件的指令字节(如 .BYTE 0×00 )进入,直到下一个 .section伪汇编指令。无论是 code (只读,可共享的),数据(初始化读/写数据,私营),或 BSS (块存储段。零初始化,不占用目标文件中的任何空间)。

.p2align :2将功率。垫,直到所需的调整NOP指令。 16 .align伪相同 .p2align 4 。跳转指令更快当目标对准的,因为取指令在16B的块,而不会跨越页边界,或者只是不跨越高速缓存行边界。 (32B对齐时,相关code已经在英特尔的SandyBridge后来的UOP缓存。)请参阅瓦格纳雾的文档的,例如。


  

为什么我加了该位的核心是要说明为什么我很困惑
  这总成code的4行版本可以有效地实现
  作为其他同样的效果。这在我看来,GCC增加了很多
  的东西,其目的我不能辨别。


本身把感兴趣的code的作用。很多事情都特别的

您是正确的,一个 MOV -immediate和 RET 的所有为执行功能所需要的,但GCC显然没有捷径识别琐碎的整个程序以及省略的堆栈帧或调用 _main 。 ><

好问题,但。正如我所说的,只是忽略所有的垃圾而发愁只是要优化小部分。

I am trying to learn how assembly works at an elementary level and so I have been playing with the -S output of gcc compilations. I wrote a simple program that defines two bytes and returns their sum. The entire program follows:

int main(void) {
  char A = 5;
  char B = 10;
  return A + B;
}

When I compile this with no optimizations using:

gcc -O0 -S -c test.c

I get test.s that looks like the following:

    .file   "test.c"
    .def    ___main;    .scl    2;  .type   32; .endef
    .text
    .globl  _main
    .def    _main;  .scl    2;  .type   32; .endef
_main:
LFB0:
    .cfi_startproc
    pushl   %ebp
    .cfi_def_cfa_offset 8
    .cfi_offset 5, -8
    movl    %esp, %ebp
    .cfi_def_cfa_register 5
    andl    $-16, %esp
    subl    $16, %esp
    call    ___main
    movb    $5, 15(%esp)
    movb    $10, 14(%esp)
    movsbl  15(%esp), %edx
    movsbl  14(%esp), %eax
    addl    %edx, %eax
    leave
    .cfi_restore 5
    .cfi_def_cfa 4, 4
    ret
    .cfi_endproc
LFE0:
    .ident  "GCC: (GNU) 4.9.2"

Now, recognizing that this program can very easily be simplified to just return a constant (15) I have been able to reduce the assembly by hand to perform the same function using this code:

.global _main
_main:
    movl    $15, %eax
    ret

This appears to me to be the least amount of code possible (but I realize could be quite wrong) to perform this admittedly trivial task. Is this form the most "optimized" version of my C program?

Why is the initial output of GCC so much more verbose? What do the lines spanning from .cfi_startproc to call__main even do? What does call __main do? I cannot figure what the two subtraction operations are for.

Even with optimizations in GCC set to -O3 I get this:

    .file   "test.c"
    .def    ___main;    .scl    2;  .type   32; .endef
    .section    .text.unlikely,"x"
LCOLDB0:
    .section    .text.startup,"x"
LHOTB0:
    .p2align 4,,15
    .globl  _main
    .def    _main;  .scl    2;  .type   32; .endef
_main:
LFB0:
    .cfi_startproc
    pushl   %ebp
    .cfi_def_cfa_offset 8
    .cfi_offset 5, -8
    movl    %esp, %ebp
    .cfi_def_cfa_register 5
    andl    $-16, %esp
    call    ___main
    movl    $15, %eax
    leave
    .cfi_restore 5
    .cfi_def_cfa 4, 4
    ret
    .cfi_endproc
LFE0:
    .section    .text.unlikely,"x"
LCOLDE0:
    .section    .text.startup,"x"
LHOTE0:
    .ident  "GCC: (GNU) 4.9.2"

Which seems to have removed a number of operations, but still leaves all the lines leading to call __main that seem unnecessary. What are all the .cfi_XXX lines for? Why are so many labels added? What do .section, .ident, .def .p2align, etc. etc. do?

I understand that many of the labels and symbols are included for debugging, but shouldn't these be stripped or omitted if I am not compiling with -g enabled?


UPDATE

To clarify, by saying

This appears to me to be the least amount of code possible (but I realize could be quite wrong) to perform this admittedly trivial task. Is this form the most "optimized" version of my C program?

I am not suggesting that I am trying to, or have achieved, an optimized version of this program. I realize the program is useless and trivial. I am just using it as a tool to learn assembly and how the compiler works.

The core of why I added this bit is to illustrate why I am confused that the 4 line version of this assembly code can effectively achieve the same effect as the others. It seems to me that GCC has added alot of "stuff" whose purpose I cannot discern.

解决方案

Thank you, Kin3TiX, for asking an asm-newbie question that wasn't just a code-dump of some nasty code with no comments, and a really simple problem. :)

As a way to get your feet wet with ASM, I'd suggest working with functions OTHER than main. e.g. just a function that takes two integer args, and adds them. Then the compiler can't optimize it away. You can still call it with constants as args, and if it's in a different file from main, it won't get inlined, so you can even single-step through it.

There's some benefit to understanding what's going on at the asm level when you compile main, but other than embedded systems, you're only ever going to write optimized inner loops in asm. IMO, there's little point using asm if you aren't going to optimize the hell out of it. Otherwise you probably won't beat compiler output from source which is much easier to read.

Other tips for understanding compiler output: compile with
gcc -S -fno-stack-check -fverbose-asm. The comments after each instruction are often nice reminders of what that load was for. Pretty soon it degenerates into a mess of temporaries with names like D.2983, but something like
movq 8(%rdi), %rcx # a_1(D)->elements, a_1(D)->elements will save you a round-trip to the ABI reference to see which function arg comes in in %rdi, and which struct member is at offset 8.

What do the lines spanning from .cfi_startproc to call__main even do?

    _main:
LFB0:
    .cfi_startproc
    pushl   %ebp
    .cfi_def_cfa_offset 8
    .cfi_offset 5, -8
    movl    %esp, %ebp
    .cfi_def_cfa_register 5

As others have said, .cfi stuff is debugging info. It's the stuff that strip will remove from your binary, or that won't be there in the first place if you didn't use -g. IDK why they're there in the -S output, without -g. Often I look at asm from objdump -d output, instead of gcc -S. Usually because I can benchmark the executable and look at its asm, without needing to invoke gcc multiple times.

The stuff with pushing %ebp and then setting it to the value of the stack pointer on function entry sets up what's called a "stack frame". This is why %ebp is called the base pointer. These insns won't be there if you compile with -fomit-frame-pointer, which gives code an extra register to work with. (This is huge for 32bit x86, since that takes you from 6 to 7 regs. (%esp is still tied up being the stack pointer; stashing it temporarily in an xmm or mmx reg and then using it as another GP reg is possible, but your code will be hard to debug!)

The leave instruction before the ret is also part of this stack frame stuff.

I'm not entirely clear on the purpose of frame pointers. With debug symbols, you can backtrace the call stack just fine even with -fomit-frame-pointer, and it's the default for amd64. (The amd64 ABI has alignment requirements for the stack, is a LOT better in other ways, too. e.g. passes args in regs instead of on the stack.)

    andl    $-16, %esp
    subl    $16, %esp

The and aligns the stack to a 16-byte boundary, regardless of what it was before. The sub reserves 16 bytes on the stack for this function. (Notice how it's missing from the optimized version, because it optimizes away any need for memory storage of any variables.)

    call    ___main

_main (asm name = __main) is probably a gcc run-time library function that calls constructors for things that need it. Maybe library setup stuff, and it might be where constructors for any of your own global / static variables are called from. (This old mailing list message indicates _main is for constructors, but that it main shouldn't have to call it on platforms that support getting the startup code to call it. Maybe i386 doesn't have that, only amd64?) edit: you said in a comment that this came from cygwin. That would explain it, since cygwin has to make non-ELF .exes.

    movb    $5, 15(%esp)
    movb    $10, 14(%esp)
    movsbl  15(%esp), %edx
    movsbl  14(%esp), %eax
    addl    %edx, %eax
    leave
    ret

Why is the initial output of GCC so much more verbose?

Without optimizations enabled, gcc maps C statements as literally as possible into asm. Doing anything else would take more compile time. Thus, movb is from the initializers for your two variables. The return value is computed by doing two loads (with sign extension, because we need to upconvert to int BEFORE the add, to match the semantics of the C code as written, as far as overflow).

I cannot figure what the two subtraction operations are for.

There is only one sub instruction. It reserves space on the stack for the function's variables, before the call to __main. Which other sub are you talking about?

What do .section, .ident, .def .p2align, etc. etc. do?

See the manual for the GNU assembler. Also available locally as info pages: run info gas.

.ident and .def: Looks like gcc putting its stamp on the object file, so you can tell what compiler / assembler produced it. Not relevant, ignore these.

.section: determines what section of the ELF object file the bytes from all following instructions or data directives (e.g. .byte 0x00) go into, until the next .section assembler directive. Either code (read-only, shareable), data (initialized read/write data, private), or bss (block storage segment. zero-initialized, doesn't take any space in the object file).

.p2align: Power of 2 Align. Pad with nop instructions until the desired alignment. .align 16 is the same as .p2align 4. Jump instruction are faster when the target is aligned, because of instruction fetch in chunks of 16B, not crossing a page boundary, or just not crossing a cache-line boundary. (32B alignment is relevant when code is already in the uop cache of an Intel Sandybridge and later.) See Agner Fog's docs, for example.

The core of why I added this bit is to illustrate why I am confused that the 4 line version of this assembly code can effectively achieve the same effect as the others. It seems to me that GCC has added alot of "stuff" whose purpose I cannot discern.

Put the code of interest in a function by itself. A lot of things are special about main.

You are correct that a mov-immediate and a ret are all that's needed to implement the function, but gcc apparently doesn't have shortcuts for recognizing trivial whole-programs and omitting main's stack frame or the call to _main. >.<

Good question, though. As I said, just ignore all that crap and worry about just the small part you want to optimize.

这篇关于GCC汇编优化 - 为什么这些相同呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆