x86 程序集:弹出一个值而不存储它 [英] x86 assembly: Pop a value without storing it

查看:25
本文介绍了x86 程序集:弹出一个值而不存储它的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 x86 程序集中,是否可以从堆栈中删除一个值而不存储它?类似于 pop word null 的东西?我显然可以使用 add esp,4,但也许我缺少一个漂亮干净的 cisc 助记符?

In x86 assembly, is it possible to remove a value from the stack without storing it? Something along the lines of pop word null? I could obviously use add esp,4, but maybe there's a nice and clean cisc mnemonic i'm missing?

推荐答案

add esp,4/add rsp,8 正常/惯用/干净的方式.不需要特殊方式,因为堆栈并不神奇或特殊(至少在这方面不是);它只是寄存器中的一个指针,其中包含一些隐式使用它的指令.(对于内核堆栈,中断异步使用它,因此软件无法实现内核红区,即使它想要......)

add esp,4 / add rsp,8 is the normal / idiomatic / clean way. No special way is needed because stacks aren't magical or special (at least not in this respect); it's just a pointer in a register with some instructions that use it implicitly. (And for kernel stacks, interrupts use it asynchronously so software couldn't implement a kernel red-zone even if it wanted to...)

除此之外,在函数末尾清理整个堆栈帧的神奇 CISC 方法是 leave = mov esp, ebp/pop ebp(或 16 或 64 位等价物).与 enter 不同,它在现代 CPU 上足够快,可以在实践中使用,但在 Intel CPU 上仍然是 3 uop 指令.(http://agner.org/optimize/).但是 leave 仅在您首先花费额外的指令使用 ebp/rbp 制作堆栈帧时才起作用.(通常您不会这样做,除非您需要保留可变数量的堆栈空间,例如在循环中使用 push 来创建数组,或者等效于 C99 VLA 或 alloca.或者对于初学者代码,以便更容易地访问本地人,或者在 16 位模式下,SP 不能用于寻址模式.)

Other than that, the magical CISC way to clean up a whole stack frame at the end of a function is leave = mov esp, ebp / pop ebp (or the 16 or 64-bit equivalent). Unlike enter, it's fast enough on modern CPUs to be usable in practice, but still a 3 uop instruction on Intel CPUs. (http://agner.org/optimize/). But leave only works in the first place if you spent extra instructions making a stack frame with ebp / rbp in the first place. (Usually you wouldn't do that, unless you need to reserve a variable amount of stack space, e.g. with push in a loop to make an array, or the equivalent of a C99 VLA or alloca. Or for beginner code to make access to locals easier, or in 16-bit mode where SP can't be used in addressing modes.)

清理堆栈参数的神奇 CISC 方法是让被调用者使用 ret imm16(花费 1 个额外的 uop)来弹出 args,创建一个调用约定,其中被调用者清理堆栈.在 caller-pops 调用约定中,无法使用这种形式的 ret,但您可以简单地保留堆栈偏移量并使用 mov 来存储下一个函数的参数call 而不是 push(如果函数根本需要任何堆栈参数;寄存器参数调用约定通常更有效.)

The magical CISC way to clean up stack-args is for the callee to use ret imm16 (costing 1 extra uop) to pop the args, creating a calling convention where the callee cleans the stack. In a caller-pops calling convention, there's no way to use this form of ret, but you can simply leave the stack offset and use mov to store args for the next function call instead of push (if the function needs any stack-args at all; register-arg calling conventions are generally more efficient.)

所以神奇的 CISC 方式在现代 CPU 上没有性能优势,只有很小的代码大小.

So the magic CISC ways have no performance advantage on modern CPUs, only minor code-size.

使用 pop reg 而不是 add esp,4 的原因有两个:

There are 2 reasons you might use pop reg instead of add esp,4:

  • code-size: pop r32/r64 是一字节指令,而 add esp,4 为 3 字节,add rsp 为 4 字节,8.
  • 性能:当您在堆栈指令(push/pop/call)之后显式使用 esp/rsp 时,英特尔的堆栈引擎必须插入额外的堆栈同步/ret).因此,在 call(返回一个 ret)之后,它会保存一个 uop 以使用 pop 而不是 add esp,4 在函数末尾 ret 之前.

  • code-size: pop r32/r64 is a one-byte instruction, vs. 3 bytes for add esp,4 or 4 bytes for add rsp,8.
  • performance: Intel's stack engine has to insert extra stack-sync uops when you use esp / rsp explicitly after a stack instruction (push/pop/call/ret). So after a call (which returns with a ret), it saves a uop to use pop instead of add esp,4 before you ret at the end of the function.

AMD 的堆栈引擎不需要额外的堆栈同步 uop,但仍会生成 push/pop 单 uop 指令.与旧的 Intel/AMD CPU 不同,push/pop 比普通的 mov 加载/存储成本更高,需要单独的 uop 来修改堆栈指针.并在堆栈指针上创建数据依赖.

AMD's stack engine doesn't need extra stack-sync uops, but still makes push/pop single-uop instructions. Unlike on older Intel/AMD CPUs, where push/pop cost more than plain mov loads/stores, needing a separate uop for the stack-pointer modification. And creating a data dependency on the stack pointer.

为什么这个函数在第一次操作时将 RAX 压入堆栈? 有关性能的更多详细信息.

See Why does this function push RAX to the stack as the first operation? for more details about performance.

如果您正在寻找美感,那么您可以很好地缩进、格式化和注释您的代码,但除了如果美感大于优化,您在选择 x86 asm 时选择了错误的语言.

If you were looking for aesthetics, well you can indent, format, and comment your code nicely, but beyond you chose the wrong language when you picked x86 asm if aesthetics outweigh optimization.

当然,如果您需要将堆栈调整超过 1 个寄存器宽度,如果您不需要 pop 加载的数据,请务必使用 add.或者,如果您需要将其调整为 +128 字节,请使用 sub esp, -128,因为 -128 可编码为符号扩展的 imm8,但 +128 不是't.

Of course, if you need to adjust the stack by more than 1 register-width, definitely use add if you don't need the data that pop would load. Or, if you need to adjust it by +128 bytes, use sub esp, -128, because -128 is encodable as a sign-extended imm8, but +128 isn't.

或者可以使用 lea esp, [esp+4],就像 gcc 使用 -mtune=atom 一样.(对于有序原子,而不是silvermont).就像我说的,如果你想要干净,你不应该选择 x86 asm.

Or maybe use lea esp, [esp+4], like gcc does with -mtune=atom. (For in-order atom, not silvermont). Like I said, if you wanted clean, you shouldn't have picked x86 asm.

你几乎总能找到一个死寄存器来pop进入.如果你需要在弹出一些你真正想要弹出的寄存器之前将 E/RSP 调整一个堆栈槽,你总是可以弹出相同的寄存器两次.

You can almost always find a dead register to pop into. If you need to adjust E/RSP by one stack slot before popping some registers you actually wanted to pop, you can always pop the same register twice.

在极少数情况下,7 (x86-32) 或 15 (x86-64) 非堆栈寄存器都不能用作 pop 目标,此优化不可用并且你应该简单地使用传统的add.花费额外的指令来使pop成为可能是不值得的;这将超过使用 pop 的次要好处.

In the extremely rare case where none of the 7 (x86-32) or 15 (x86-64) non-stack register are available as pop destinations, this optimization is not available and you should simply use the traditional add. It's not worth spending extra instructions to make it possible to pop; that would outweigh the minor benefit of using pop.

请注意,pop Sreg(段寄存器)仍然消耗常规的堆栈宽度"(32 位或 64 位,取决于模式),而不是 16 位寄存器仅消耗 16 位.但只有 pop ds/es/ss 是单字节的.pop fs/gs 每个都是 2 个字节.因此,如果您正在优化代码大小,pop gsadd esp,4 小 1 个字节,但要慢得多.(或比 add rsp,8 小 2 个字节).

Note that pop Sreg (segment register) still consumes the regular "stack width" (32 or 64 bits, depending on mode), rather than only 16 for a 16-bit register. But only pop ds/es/ss are single-byte. pop fs/gs are 2 bytes each. So if you're optimizing for code-size, pop gs is 1 byte smaller than add esp,4, but much much slower. (Or 2 bytes smaller than add rsp,8).

这篇关于x86 程序集:弹出一个值而不存储它的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆