为什么编译器坚持在此处使用被调用者保存的寄存器? [英] Why do compilers insist on using a callee-saved register here?

查看:124
本文介绍了为什么编译器坚持在此处使用被调用者保存的寄存器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑以下C代码:

void foo(void);

long bar(long x) {
    foo();
    return x;
}

当我用-O3-Os在GCC 9.3上编译它时,我得到了:

bar:
        push    r12
        mov     r12, rdi
        call    foo
        mov     rax, r12
        pop     r12
        ret

clang的输出相同,除了选择rbx而不是r12作为被调用方保存的寄存器.

但是,我希望/希望看到看起来更像这样的程序集:

bar:
        push    rdi
        call    foo
        pop     rax
        ret

用英语,这就是我所看到的:

  • 将已保存被调用者的寄存器的旧值推入堆栈
  • x移至该被调用方保存的寄存器中
  • 致电foo
  • x从已保存被调用方的寄存器移至返回值寄存器
  • 弹出堆栈以恢复被调用方保存的寄存器的旧值

为什么要打扰所有保存有被调用方的寄存器?为什么不这样做呢?似乎更短,更简单,甚至可能更快:

  • x推入堆栈
  • 致电foo
  • 从堆栈中弹出x到返回值寄存器中

我的大会错了吗?它是否以某种方式比弄乱一个额外的寄存器效率低?如果这两个答案都为否",那么为什么GCC或clang都不这样做?

Godbolt链接.


这是一个比较简单的示例,以显示即使有意义地使用了变量,它也会发生:

long foo(long);

long bar(long x) {
    return foo(x * x) - x;
}

我明白了:

bar:
        push    rbx
        mov     rbx, rdi
        imul    rdi, rdi
        call    foo
        sub     rax, rbx
        pop     rbx
        ret

我宁愿这样:

bar:
        push    rdi
        imul    rdi, rdi
        call    foo
        pop     rdi
        sub     rax, rdi
        ret

这一次,只有两条指令,而只有两条指令,但是核心概念是相同的.

Godbolt链接.

解决方案

TL:DR:

  • 可能没有设置编译器内部组件来轻松查找此优化,并且它可能仅在小函数周围有用,而在调用之间不在大函数内部有用.
  • 大多数情况下,内联创建大型函数是一个更好的解决方案
  • 如果foo碰巧不保存/恢复RBX,则可能会在延迟与吞吐量之间进行权衡.

编译器是复杂的机械零件.它们不像人类那样聪明",并且寻找所有可能的优化方法的昂贵算法通常不值得花费额外的编译时间.

我将其报告为 GCC错误69986-使用-早在2016年,通过使用push/pop溢出/重新加载操作系统 ;没有GCC开发者的活动或答复. :/

略有相关: GCC错误70408-​​重复使用相同的调用保留寄存器会导致某些情况下使用较小的代码-编译器开发人员告诉我,GCC能够进行优化需要大量的工作,因为这需要根据对两个foo(int)调用的求值顺序进行选择.简化目标asm.


如果 foo本身不保存/恢复rbx,则在吞吐量(指令数)与x上的额外存储/重新加载延迟之间进行权衡-> retval依赖链.

编译器通常会优先考虑延迟而不是吞吐量,例如使用2倍LEA而不是imul reg, reg, 10(3个周期的延迟,1/个时钟的吞吐量),因为在典型的4宽管道(如Skylake)上,大多数代码的平均时间明显低于4 oups/时钟. (但是,更多的指令/指令确实会在ROB中占用更多的空间,从而减少了同一乱序窗口可以看到的距离,而且执行实际上是突发性的,因为停顿很可能导致了一些少于4的指令/时钟平均值.)

如果foo确实推送/弹出RBX,则延迟没有太大收获.除非发生ret错误预测或I-cache未命中会延迟在返回地址处获取代码的情况,否则将恢复操作发生在ret之前而不是紧接在ret之前可能无关紧要.

大多数非平凡的函数都会保存/恢复RBX,因此,通常并不能很好地假设将变量保留在RBX中实际上意味着它确实在整个调用过程中都保留在寄存器中. (尽管随机选择哪个保留呼叫的寄存器功能可能是减轻这种情况的一个好主意.)


是的,在这种情况下,push rdi/pop rax会更有效,这可能是针对微小的非叶子函数的优化遗漏,具体取决于foo的工作方式和在x的额外存储/重载延迟与更多保存/恢复调用方rbx的指令之间取得平衡.

堆栈展开元数据可以在此处表示对RSP的更改,就像它已使用sub rsp, 8x溢出/重新加载到堆栈插槽中一样. (但是编译器也不知道使用push保留空间并初始化变量的优化.

Godbolt link.


Edit: Here's a less trivial example, to show it happens even if the variable is meaningfully used:

long foo(long);

long bar(long x) {
    return foo(x * x) - x;
}

I get this:

bar:
        push    rbx
        mov     rbx, rdi
        imul    rdi, rdi
        call    foo
        sub     rax, rbx
        pop     rbx
        ret

I'd rather have this:

bar:
        push    rdi
        imul    rdi, rdi
        call    foo
        pop     rdi
        sub     rax, rdi
        ret

This time, it's only one instruction off vs. two, but the core concept is the same.

Godbolt link.

解决方案

TL:DR:

  • Compiler internals are probably not set up to look for this optimization easily, and it's probably only useful around small functions, not inside large functions between calls.
  • Inlining to create large functions is a better solution most of the time
  • There can be a latency vs. throughput tradeoff if foo happens not to save/restore RBX.

Compilers are complex pieces of machinery. They're not "smart" like a human, and expensive algorithms to find every possible optimization are often not worth the cost in extra compile time.

I reported this as GCC bug 69986 - smaller code possible with -Os by using push/pop to spill/reload back in 2016; there's been no activity or replies from GCC devs. :/

Slightly related: GCC bug 70408 - reusing the same call-preserved register would give smaller code in some cases - compiler devs told me it would take a huge amount of work for GCC to be able to do that optimization because it requires picking order of evaluation of two foo(int) calls based on what would make the target asm simpler.


If foo doesn't save/restore rbx itself, there's a tradeoff between throughput (instruction count) vs. an extra store/reload latency on the x -> retval dependency chain.

Compilers usually favour latency over throughput, e.g. using 2x LEA instead of imul reg, reg, 10 (3-cycle latency, 1/clock throughput), because most code averages significantly less than 4 uops / clock on typical 4-wide pipelines like Skylake. (More instructions/uops do take more space in the ROB, reducing how far ahead the same out-of-order window can see, though, and execution is actually bursty with stalls probably accounting for some of the less-than-4 uops/clock average.)

If foo does push/pop RBX, then there's not much to gain for latency. Having the restore happen just before the ret instead of just after is probably not relevant, unless there a ret mispredict or I-cache miss that delays fetching code at the return address.

Most non-trivial functions will save/restore RBX, so it's often not a good assumption that leaving a variable in RBX will actually mean it truly stayed in a register across the call. (Although randomizing which call-preserved registers functions choose might be a good idea to mitigate this sometimes.)


So yes push rdi / pop rax would be more efficient in this case, and this is probably a missed optimization for tiny non-leaf functions, depending on what foo does and the balance between extra store/reload latency for x vs. more instructions to save/restore the caller's rbx.

It is possible for stack-unwind metadata to represent the changes to RSP here, just like if it had used sub rsp, 8 to spill/reload x into a stack slot. (But compilers don't know this optimization either, of using push to reserve space and initialize a variable. What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?. And doing that for more than one local var would lead to larger .eh_frame stack unwind metadata because you're moving the stack pointer separately with each push. That doesn't stop compilers from using push/pop to save/restore call-preserved regs, though.)


IDK if it would be worth teaching compilers to look for this optimization

It's maybe a good idea around a whole function, not across one call inside a function. And as I said, it's based on the pessimistic assumption that foo will save/restore RBX anyway. (Or optimizing for throughput if you know that latency from x to return value isn't important. But compilers don't know that and usually optimize for latency).

If you start making that pessimistic assumption in lots of code (like around single function calls inside functions), you'll start getting more cases where RBX isn't saved/restored and you could have taken advantage.

You also don't want this extra save/restore push/pop in a loop, just save/restore RBX outside the loop and use call-preserved registers in loops that make function calls. Even without loops, in the general case most functions make multiple function calls. This optimization idea could apply if you really don't use x between any of the calls, just before the first and after the last, otherwise you have a problem of maintaining 16-byte stack alignment for each call if you're doing one pop after a call, before another call.

Compilers are not great at tiny functions in general. But it's not great for CPUs either. Non-inline function calls have an impact on optimization at the best of times, unless compilers can see the internals of the callee and make more assumptions than usual. A non-inline function call is an implicit memory barrier: a caller has to assume that a function might read or write any globally-accessible data, so all such vars have to be in sync with the C abstract machine. (Escape analysis allows keeping locals in registers across calls if their address hasn't escaped the function.) Also, the compiler has to assume that the call-clobbered registers are all clobbered. This sucks for floating point in x86-64 System V, which has no call-preserved XMM registers.

Tiny functions like bar() are better off inlining into their callers. Compile with -flto so this can happen even across file boundaries in most cases. (Function pointers and shared-library boundaries can defeat this.)


I think one reason compilers haven't bothered to try to do these optimizations is that it would require a whole bunch of different code in the compiler internals, different from the normal stack vs. register-allocation code that knows how to save call-preserved registers and use them.

i.e. it would be a lot of work to implement, and a lot of code to maintain, and if it gets over-enthusiastic about doing this it could make worse code.

And also that it's (hopefully) not significant; if it matters, you should be inlining bar into its caller, or inlining foo into bar. This is fine unless there are a lot of different bar-like functions and foo is large, and for some reason they can't inline into their callers.

这篇关于为什么编译器坚持在此处使用被调用者保存的寄存器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆