无法理解调用者不需要清理堆栈的cdecl调用约定的示例 [英] Unable to understand example of cdecl calling convention where caller doesnt need to clean the stack

查看:267
本文介绍了无法理解调用者不需要清理堆栈的cdecl调用约定的示例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读 IDA专业版.在第86页上,在讨论调用约定时,作者展示了一个cdecl调用约定的示例,该示例消除了调用程序从堆栈中清除参数的需要.我正在复制下面的代码片段:

I am reading the IDA Pro Book. On page 86 while discussing calling conventions, the author shows an example of cdecl calling convention that eliminates the need for the caller to clean arguments off the stack. I am reproducing the code snippet below:

; demo_cdecl(1, 2, 3, 4); //programmer calls demo_cdecl
mov [esp+12], 4 ; move parameter z to fourth position on stack
mov [esp+8], 3 ; move parameter y to third position on stack
mov [esp+4], 2 ; move parameter x to second position on stack
mov [esp], 1 ; move parameter w to top of stack
call demo_cdecl ; call the function

作者继续说

在上面的示例中,编译器在函数序言中为堆栈顶部的demo_cdecl参数预分配了存储空间.

in the above example, the compiler has preallocated storage space for the arguments to demo_cdecl at the top of the stack during the function prologue.

我将假定代码段顶部有一个sub esp, 0x10.否则,您将破坏堆栈.

I am going to assume that there is a sub esp, 0x10 at the top of the code snippet. Otherwise, you would just be corrupting the stack.

他后来说,对demo_cdecl的调用完成时,调用方不需要调整堆栈.但是可以肯定的是,调用后必须有一个add esp, 0x10.

He later says that the caller doesn't need to adjust the stack when call to demo_cdecl completes. But surely, there has to be a add esp, 0x10 after the call.

我到底想念什么?

推荐答案

如果已经分配了足够的空间(例如,像你建议的.)

Compilers often choose mov to store args instead of push, if there's enough space already allocated (e.g. with a sub esp, 0x10 earlier in the function like you suggested).

这是一个例子:

int f1(int);
int f2(int,int);

int foo(int a) {
    f1(2);
    f2(3,4);

    return f1(a);
}

compiled by clang6.0 -O3 -march=haswell on Godbolt

    sub     esp, 12                # reserve space to realign stack by 16
    mov     dword ptr [esp], 2     # store arg
    call    f1(int)
                    # reuse the same arg-passing space for the next function
    mov     dword ptr [esp + 4], 4  
    mov     dword ptr [esp], 3
    call    f2(int, int)
    add     esp, 12
                    # now ESP is pointing to our own arg
    jmp     f1(int)                  # TAILCALL

使用sub esp,8/push 2时,

clang的代码生成会更好,但随后其余功能保持不变.即让push增大堆栈,因为它的代码大小小于mov,尤其是mov即时,并且性能也不会变差(因为我们将要同时使用堆栈引擎的call).参见什么C/C ++编译器可以使用推式弹出指令来创建局部变量,而不仅仅是增加esp一次?以获取更多详细信息.

clang's code-gen would have been even better with sub esp,8 / push 2, but then the rest of the function unchanged. i.e. let push grow the stack because it has smaller code-size that mov, especially mov-immediate, and performance is not worse (because we're about to call which also uses the stack engine). See What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once? for more details.

我还在Godbolt链接的GCC输出中添加了/不带 -maccumulate-outgoing-args,它推迟清除堆栈,直到函数结束..

I also included in the Godbolt link GCC output with/without -maccumulate-outgoing-args that defers clearing the stack until the end of the function..

默认情况下(不累积传出的args),gcc会让ESP反弹,甚至使用2x pop从堆栈中清除2个args. (避免堆栈同步uop,以2次在L1d缓存中击中无用的负载为代价).要清除3个或更多的args,gcc使用add esp, 4*N.我怀疑用mov存储区重用arg传递空间而不是添加esp/push有时会提高整体性能,尤其是用寄存器而不是立即数. (push imm8mov imm32小得多.)

By default (without accumulate outgoing args) gcc does let ESP bounce around, and even uses 2x pop to clear 2 args from the stack. (Avoiding a stack-sync uop, at the cost of 2 useless loads that hit in L1d cache). With 3 or more args to clear, gcc uses add esp, 4*N. I suspect that reusing the arg-passing space with mov stores instead of add esp / push would be a win sometimes for overall performance, especially with registers instead of immediates. (push imm8 is much more compact than mov imm32.)

foo(int):            # gcc7.3 -O3 -m32   output
    push    ebx
    sub     esp, 20
    mov     ebx, DWORD PTR [esp+28]    # load the arg even though we never need it in a register
    push    2                          # first function arg
    call    f1(int)
    pop     eax
    pop     edx                        # clear the stack
    push    4
    push    3                          # and write the next two args
    call    f2(int, int)
    mov     DWORD PTR [esp+32], ebx    # store `a` back where we it already was
    add     esp, 24
    pop     ebx
    jmp     f1(int)                    # and tailcall

使用-maccumulate-outgoing-args时,输出基本上类似于clang,但是gcc在进行尾调用之前仍会保存/恢复ebx并将a保留在其中.

With -maccumulate-outgoing-args, the output is basically like clang, but gcc still save/restores ebx and keeps a in it, before doing a tailcall.

请注意,使ESP反弹需要.eh_frame中的额外元数据来展开堆栈. Jan Hubicka在2014年写道:

Note that having ESP bounce around requires extra metadata in .eh_frame for stack unwinding. Jan Hubicka writes in 2014:

arg的累积仍然有其利弊.我做得很广泛 在AMD芯片上进行测试,发现它的性能中立.在32位代码上,它可以保存 大约4%的代码,但是禁用了帧指针,它可以展开展开信息 很多,因此生成的二进制文件大约大8%. (这也是-Os的当前默认值)

There are still pros and cons of arg accumulation. I did quite extensive testing on AMD chips and found it performance neutral. On 32bit code it saves about 4% of code but with frame pointer disabled it expands unwind info quite a lot, so resulting binary is about 8% bigger. (This is also current default for -Os)

因此,使用push args可以节省4%的代码大小(以字节为单位;对于L1i缓存占用空间很重要),并且每个call之后至少通常将它们从堆栈中清除.我认为这里有一个快乐的媒介,就是gcc可以使用更多的push而不使用 just push/pop.

So a 4% code-size saving (in bytes; matters for L1i cache footprint) from using push for args and at least typically clearing them off the stack after each call. I think there's a happy medium here that gcc could use more push without using just push/pop.

call之前保持16字节堆栈对齐会产生混淆的影响,这是当前版本的i386 System V ABI所要求的.在32位模式下,它过去只是gcc的默认值,用于维护-mpreferred-stack-boundary=4. (即1 << 4).我想你仍然可以使用 -mpreferred-stack-boundary=2违反ABI并编写仅关心ESP的4B对齐的代码.

There's a confounding effect of maintaining 16-byte stack alignment before call, which is required by the current version of the i386 System V ABI. In 32-bit mode, it used to just be a gcc default to maintain -mpreferred-stack-boundary=4. (i.e. 1<<4). I think you can still use -mpreferred-stack-boundary=2 to violate the ABI and make code that only cares about 4B alignment for ESP.

我没有在Godbolt上尝试过,但是可以.

I didn't try this on Godbolt, but you could.

这篇关于无法理解调用者不需要清理堆栈的cdecl调用约定的示例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆