如果要进行额外的堆栈对齐,gcc奇怪的堆栈操作又是怎么回事? [英] What's up with gcc weird stack manipulation when it wants extra stack alignment?

查看:146
本文介绍了如果要进行额外的堆栈对齐,gcc奇怪的堆栈操作又是怎么回事?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经几次见过这个r10怪异,所以让我们看看是否有人知道发生了什么.

I've seen this r10 weirdness a few times, so let's see if anyone knows what's up.

执行以下简单功能:

#define SZ 4

void sink(uint64_t *p);

void andpop(const uint64_t* a) {
    uint64_t result[SZ];
    for (unsigned i = 0; i < SZ; i++) {
        result[i] = a[i] + 1;
    }

    sink(result);
}

它只将传入数组的4个64位元素中的每个元素加1,并将其存储在本地,然后对结果调用sink()(以避免整个函数被优化).

It just adds 1 to each of the 4 64-bit elements of the passed-in array and stores it in a local and calls sink() on the result (to avoid the whole function being optimized away).

这是相应的程序集:

andpop(unsigned long const*):
        lea     r10, [rsp+8]
        and     rsp, -32
        push    QWORD PTR [r10-8]
        push    rbp
        mov     rbp, rsp
        push    r10
        sub     rsp, 40
        vmovdqa ymm0, YMMWORD PTR .LC0[rip]
        vpaddq  ymm0, ymm0, YMMWORD PTR [rdi]
        lea     rdi, [rbp-48]
        vmovdqa YMMWORD PTR [rbp-48], ymm0
        vzeroupper
        call    sink(unsigned long*)
        add     rsp, 40
        pop     r10
        pop     rbp
        lea     rsp, [r10-8]
        ret

很难理解r10发生的几乎所有事情.首先,将r10设置为指向rsp + 8,然后将其指向push QWORD PTR [r10-8],据我所知,它会将返回地址的副本压入堆栈.然后,将rbp设置为正常状态,然后最后将r10本身压入.

It's hard to understand almost everything that is going on with r10. First, r10 is set to point to rsp + 8, then push QWORD PTR [r10-8], which as far as I can tell pushes a copy of the return address on the stack. Following that, rbp is set up as normal and then finally r10 itself is pushed.

要展开所有操作,请从堆栈中弹出r10并用于将rsp恢复为其原始值.

To unwind all this, r10 is popped off of the stack and used to restore rsp to its original value.

一些观察结果:

  • 查看整个函数,所有这些似乎都是一种完全c回的方式,只需将rsp恢复为ret之前的原始值-但通常的mov rsp, rpb结语也可以这样做(请参阅)!
  • 也就是说,(昂贵的)push QWORD PTR [r10-8]甚至没有帮助执行该任务:此值(寄信人地址?)显然从未使用过.
  • 为什么要完全按下并弹出r10?该值不会在很小的功能主体中消失,也没有寄存器压力.
  • Looking at the entire function, all of this seems like a totally roundabout way of simply restoring rsp to it's original value before ret - but the usual epilog of mov rsp, rpb would do just as well (see clang)!
  • That said, the (expensive) push QWORD PTR [r10-8] doesn't even help in that mission: this value (the return address?) is apparently never used.
  • Why is r10 pushed and popped at all? The value isn't clobbered in the very small function body and there is no register pressure.

这是怎么回事?我之前已经看过好几次了,它通常要使用r10,有时是r13.似乎可能与将堆栈对齐为32个字节有关,因为如果将SZ更改为小于4,则会使用xmm ops,问题就会消失.

What's up with that? I've seen it several times before, and it usually wants to use r10, sometimes r13. It seems likely that has something to do with aligning the stack to 32 bytes, since if you change SZ to be less than 4 it uses xmm ops and the issue disappears.

例如SZ == 2

andpop(unsigned long const*):
        sub     rsp, 24
        vmovdqa xmm0, XMMWORD PTR .LC0[rip]
        vpaddq  xmm0, xmm0, XMMWORD PTR [rdi]
        mov     rdi, rsp
        vmovaps XMMWORD PTR [rsp], xmm0
        call    sink(unsigned long*)
        add     rsp, 24
        ret

好多了!

推荐答案

好,您回答了您的问题:堆栈指针需要对齐到32个字节,然后才能使用对齐的AVX2加载和存储来访问,但仅限于ABI提供16个字节的对齐方式.由于编译器无法知道对齐量有多大,因此必须将堆栈指针保存在暂存寄存器中,然后再将其恢复.但是保存的值必须比函数调用有效,因此必须将其放在堆栈上,并且必须创建堆栈框架.

Well, you answered your question: The stack pointer needs to be aligned to 32 bytes before it can be accessed with aligned AVX2 loads and stores, but the ABI only provides 16 byte alignment. Since the compiler cannot know how much the alignment is off, it has to save the stack pointer in a scratch register and restore it afterwards. But the saved value has to outlive the function call, so it has to be put on the stack, and a stack frame has to be created.

某些x86-64 ABI有一个红色区域(堆栈指针下方的堆栈区域,信号处理程序不使用该区域),因此对于这样的短函数,根本不更改堆栈指针是可行的,但可以使用GCC显然没有实现此优化,并且由于最后的函数调用,因此也不适用于此.

Some x86-64 ABIs have a red zone (a region of the stack below the stack pointer which is not used by signal handlers), so it is feasible not to change the stack pointer at all for such short functions, but GCC apparently does not implement this optimization and it would not apply here anyway because of the function call at the end.

此外,默认的堆栈对齐方式实施情况很差.对于这种情况,-maccumulate-outgoing-args使用GCC 6可以生成更好看的代码,只是在保存RBP之后对齐RSP,而不是在保存RBP之前复制返回地址:

In addition, the default stack alignment implementation is rather poor. For this case, -maccumulate-outgoing-args results in better-looking code with GCC 6, just aligning RSP after saving RBP, instead of copying the return address before saving RBP:

andpop:
        pushq   %rbp
        movq    %rsp, %rbp            # make a traditional stack frame
        andq    $-32, %rsp            # reserve 0 or 16 bytes
        subq    $32, %rsp

        vmovdqu (%rdi), %xmm0         # split unaligned load from tune=generic
        vinserti128     $0x1, 16(%rdi), %ymm0, %ymm0   # use -march=haswell instead
        movq    %rsp, %rdi
        vpaddq  .LC0(%rip), %ymm0, %ymm0
        vmovdqa %ymm0, (%rsp)

        vzeroupper
        call    sink@PLT
        leave
        ret

(编者注:gcc8后来化妆ASM这样的默认(的 Godbolt编译器资源管理器与gcc8,clang7,ICC19,和MSVC ),即使没有-maccumulate-outgoing-args)

最近,当我们不得不为GCC __tls_get_addr ABI错误实施变通方法时,出现了这个问题(GCC为堆栈对齐生成了不良代码),最后我们手工编写了堆栈重新对齐.

This issue (GCC generating poor code for stack alignment) recently came up when we had to implement a workaround for GCC __tls_get_addr ABI bug, and we ended up writing the stack realignment by hand.

编辑还有另一个与RTL传递顺序相关的问题:在最终确定是否实际需要堆栈之前,先选择堆栈对齐方式,

EDIT There is also another issue, related to RTL pass ordering: stack alignment is picked before the final determination whether the stack is actually needed, as BeeOnRope's second example shows.

这篇关于如果要进行额外的堆栈对齐,gcc奇怪的堆栈操作又是怎么回事?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆