为什么 Mac ABI 需要 x86-32 的 16 字节堆栈对齐? [英] Why does the Mac ABI require 16-byte stack alignment for x86-32?

查看:37
本文介绍了为什么 Mac ABI 需要 x86-32 的 16 字节堆栈对齐?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以理解旧 PPC RISC 系统甚至 x86-64 的这种需求,但是旧的久经考验的 x86 呢?在这种情况下,堆栈只需要在 4 字节边界上对齐.是的,一些 MMX/SSE 指令需要 16 字节对齐,但如果这是被调用者的要求,那么它应该确保对齐是正确的.为什么要让每个来电者承担这个额外的要求?这实际上会导致性能下降,因为每个呼叫站点都必须管理此要求.我错过了什么吗?

更新:在对此进行了更多调查并咨询了一些内部同事之后,我对此有了一些理论:

  1. PPC、x86 和 x64 版本的操作系统之间的一致性
  2. 似乎 GCC 代码生成器现在始终执行 sub esp,xxx,然后将数据mov"到堆栈上,而不是简单地执行push"操作.操作说明.这实际上在某些硬件上可能会更快.
  3. 虽然这确实使调用站点复杂化了一点,但使用默认的cdecl"时几乎没有额外的开销.调用者清理堆栈的约定.

我对最后一项的问题是,对于依赖于被调用者清理堆栈的调用约定,上述要求确实丑化"了.代码生成器.例如,某些编译器决定为自己的内部使用实现更快的基于寄存器的调用风格(即不打算从其他语言或来源调用的任何代码)?这种堆栈对齐方式可能会抵消通过在寄存器中传递一些参数所获得的一些性能提升.

更新: 到目前为止,唯一真正的答案是一致性,但对我来说,这有点太简单了.我在 x86 架构方面拥有超过 20 年的经验,如果一致性,而不是性能,或其他具体的东西,真的是原因,那么我恭敬地建议,对于开发人员来说,要求它有点幼稚.他们忽略了近三年的工具和支持.特别是如果他们希望工具供应商能够快速轻松地为他们的平台调整他们的工具(也许不是......它 Apple......)而不必跳过几个看似不必要的障碍.>

我改天再讲这个话题然后关闭它...

相关

解决方案

来自Intel®64 and IA-32 Architectures Optimization Reference Manual",第 4.4.2 节:

为了获得最佳性能,Streaming SIMD Extensions 和 Streaming SIMD Extensions 2 要求它们的内存操作数对齐到 16 字节边界.与对齐数据相比,未对齐的数据会导致显着的性能损失."

来自附录 D:

在函数进入时确保堆栈帧与 16 字节边界对齐非常重要,以保持本地 __m128 数据、参数和 XMM 寄存器溢出位置在整个函数调用过程中对齐."

http://www.intel.com/Assets/PDF/manual/248966.pdf

I can understand this requirement for the old PPC RISC systems and even for x86-64, but for the old tried-and-true x86? In this case, the stack needs to be aligned on 4 byte boundaries only. Yes, some of the MMX/SSE instructions require 16byte alignments, but if that is a requirement of the callee, then it should ensure the alignments are correct. Why burden every caller with this extra requirement? This can actually cause some drops in performance because every call-site must manage this requirement. Am I missing something?

Update: After some more investigation into this and some consultation with some internal colleagues, I have some theories about this:

  1. Consistency between the PPC, x86, and x64 version of the OS
  2. It seems that the GCC codegen now consistently does a sub esp,xxx and then "mov"s the data onto the stack rather than simply doing a "push" instruction. This could actually be faster on some hardware.
  3. While this does complicate the call sites a little, there is very little extra overhead when using the default "cdecl" convention where the caller cleans up the stack.

The issue I have with the last item, is that for calling conventions that rely on the callee cleaning the stack, the above requirements really "uglifies" the codegen. For instance, what some compiler decided to implement a faster register-based calling style for its own internal use (ie any code that isn't intended to be called from other languages or sources)? This stack-alignment thing could negate some of the performance gains achieved by passing some parameters in registers.

Update: So far the only real answers have been consistency, but to me that's a bit too easy of an answer. I have well over 20 years experience with the x86 architecture and if consistency, not performance, or something else concrete, is really the reason then I respectfully suggest that is a bit naive for the developers to require it. They're ignoring nearly three decades of tools and support. Especially if they're expecting tools vendors to quickly and easily adapt their tools for their platform (maybe not... it is Apple...) without having to jump through several seemingly unnecessary hoops.

I'll give this topic another day or so then close it...

Related

解决方案

From "Intel®64 and IA-32 Architectures Optimization Reference Manual", section 4.4.2:

"For best performance, the Streaming SIMD Extensions and Streaming SIMD Extensions 2 require their memory operands to be aligned to 16-byte boundaries. Unaligned data can cause significant performance penalties compared to aligned data."

From Appendix D:

"It is important to ensure that the stack frame is aligned to a 16-byte boundary upon function entry to keep local __m128 data, parameters, and XMM register spill locations aligned throughout a function invocation."

http://www.intel.com/Assets/PDF/manual/248966.pdf

这篇关于为什么 Mac ABI 需要 x86-32 的 16 字节堆栈对齐?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆