为什么 x86-64 上的 GCC 在函数内部插入 NOP? [英] Why does GCC on x86-64 insert a NOP inside of a function?

查看:51
本文介绍了为什么 x86-64 上的 GCC 在函数内部插入 NOP?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定以下 C 函数:

void go(char *data) {
    char name[64];
    strcpy(name, data);
}

x86-64 上的 GCC 5 和 6 编译(普通 gcc -c -g -o 后跟 objdump):

GCC 5 and 6 on x86-64 compile (plain gcc -c -g -o followed by objdump) this to:

0000000000000000 <go>:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   48 83 ec 50             sub    $0x50,%rsp
   8:   48 89 7d b8             mov    %rdi,-0x48(%rbp)
   c:   48 8b 55 b8             mov    -0x48(%rbp),%rdx
  10:   48 8d 45 c0             lea    -0x40(%rbp),%rax
  14:   48 89 d6                mov    %rdx,%rsi
  17:   48 89 c7                mov    %rax,%rdi
  1a:   e8 00 00 00 00          callq  1f <go+0x1f>
  1f:   90                      nop
  20:   c9                      leaveq 
  21:   c3                      retq   

GCC 是否有任何理由在 1f 处插入 90/nop 或者这只是可能发生的副作用没有开启优化?

Is there any reason for GCC to insert the 90/nop at 1f or is that just a side-effect that might happen when no optimizations are turned on?

注意:这个问题与大多数其他问题不同,因为它询问的是函数体内的 nop,而不是外部填充.

Note: This question is different from most others because it asks about nop inside a function body, not an external padding.

测试的编译器版本:GCC Debian 5.3.1-14 (5.3.1) 和 Debian 6-20160313-1 (6.0.0)

推荐答案

这很奇怪,我之前从没注意到 -O0 处的 asm 输出中存在杂散的 nop.(可能是因为我不会浪费时间查看未优化的编译器输出).

That's weird, I'd never noticed stray nops in the asm output at -O0 before. (Probably because I don't waste my time looking at un-optimized compiler output).

通常 nop 内部函数是对齐分支目标,包括函数入口点,如 Brian 链接的问题.(另请参阅 gcc 文档中的 -falign-loops ,默认情况下在 -Os 以外的优化级别启用).

Usually nops inside functions are to align branch targets, including function entry points like in the question Brian linked. (Also see -falign-loops in the gcc docs, which is on by default at optimization levels other than -Os).

在这种情况下,nop 是一个空函数的编译器噪音的一部分:

In this case, the nop is part of the compiler noise for a bare empty function:

void go(void) {
    //char name[64];
    //strcpy(name, data);
}
    push    rbp
    mov     rbp, rsp
    nop                     # only present for gcc5, not gcc 4.9.3
    pop     rbp
    ret

在 Godbolt Compiler Explorer 中查看该代码 这样你就可以检查其他编译器版本的 asm 和编译选项.

See that code in the Godbolt Compiler Explorer so you can check the asm for other compiler versions and compile options.

(技术上不是噪音,但 -O0 启用 -fno-omit-frame-pointer,并且在 -O0 甚至空函数设置和拆除堆栈帧.)

(Not technically noise, but -O0 enables -fno-omit-frame-pointer, and at -O0 even empty functions set up and tear down a stack frame.)

当然,nop 不存在于任何非零优化级别.问题中的代码中的 nop 没有调试或性能优势.(请参阅 标签维基,特别是 Agner Fog 的微架构指南,了解是什么让代码在当前 CPU 上运行得更快.)

Of course, that nop is not present at any non-zero optimization level. There's no debugging or performance advantage to that nop in the code in the question. (See the performance guide links in the x86 tag wiki, esp. Agner Fog's microarchitecture guide to learn about what makes code fast on current CPUs.)

我的猜测是它纯粹是 gcc 内部结构的产物.这个 nopgcc -S asm 输出中作为 nop 存在,而不是作为 .p2align 指令.gcc 本身不计算机器码字节数,它只是在某些点使用对齐指令来对齐重要的分支目标.只有汇编程序知道达到给定对齐实际上需要多大的 nop.

My guess is that it's purely an artifact of gcc internals. This nop is there as a nop in the gcc -S asm output, not as a .p2align directive. gcc itself doesn't count machine code bytes, it just uses alignment directives at certain points to align important branch targets. Only the assembler knows how big a nop is actually needed to reach the given alignment.

默认的 -O0 告诉 gcc 你希望它编译得快,而不是写出好的代码.这意味着与其他 -O 级别相比,asm 输出会告诉您更多有关 gcc 内部结构的信息,而很少涉及如何优化或其他任何内容.

The default -O0 tells gcc that you want it to compile fast and not make good code. This means the asm output tells you more about gcc internals than other -O levels, and very little about how to optimize or anything else.

如果您正在尝试学习 asm,那么查看 -Og 处的代码会更有趣,例如(针对调试进行优化).

If you're trying to learn asm, it's more interesting to look at the code at -Og, for example (optimize for debugging).

如果您想了解 gcc 或 clang 在编写代码方面的表现,您应该查看 -O3 -march=native(或 -O2 -mtune=intel,或您构建项目时使用的任何设置).不过,弄明白在 -O3 中所做的优化是学习一些 asm 技巧的好方法.-fno-tree-vectorize 如果您想查看完全优化的非矢量化版本,则非常方便.

If you're trying to see how well gcc or clang do at making code, you should look at -O3 -march=native (or -O2 -mtune=intel, or whatever settings you build your project with). Puzzling out the optimizations made at -O3 is a good way to learn some asm tricks, though. -fno-tree-vectorize is handy if you want to see a non-vectorized version of something fully optimized other than that.

这篇关于为什么 x86-64 上的 GCC 在函数内部插入 NOP?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆