如何复制寄存器并以最少的指令数执行`x*4 + constant` [英] How to copy a register and do `x*4 + constant` with the minimum number of instructions

查看:30
本文介绍了如何复制寄存器并以最少的指令数执行`x*4 + constant`的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 x86 程序集的新手.例如下面的指令:将ESP的内容乘以4,加上0x11233344,结果存入EDI.

I am new to x86 assembly. For example for the following instruction: multiply the contents of ESP by 4 and add 0x11233344, storing the result in EDI.

如何以最少的指令数在 x86 汇编中表示此指令?

How do I represent this instruction in x86 assembly with minimum number of instructions?

push esp
mov edi, 4
mul edi
add edi, 0x11233344

推荐答案

你的 asm 没有任何意义(push esp 复制到内存,而不是另一个寄存器),并且 mul edi 写的是 EDX:EAX 而不是 edi.它执行 EDX:EAX = EAX * src_operand.阅读手册:https://www.felixcloutier.com/x86/MUL.html.或者更好的是,使用 imul 代替,除非您确实需要完整 32x32 => 64 位乘法的高半输出.

Your asm doesn't make any sense (push esp copies to memory, not another register), and mul edi writes EDX:EAX not edi. It does EDX:EAX = EAX * src_operand. Read the manual: https://www.felixcloutier.com/x86/MUL.html. Or better, use imul instead unless you actually need the high-half output of the full 32x32 => 64-bit multiply.

此外,不要使用堆栈指针寄存器 ESP 来保存临时值,除非您确切地知道自己在做什么(例如,您在用户空间中,并且您已确保没有信号处理程序可以异步使用堆栈.)堆栈指针 * 4 + 大常量不是普通程序会做的事情.

Also, don't use the stack pointer register ESP to hold temporary values unless you know exactly what you're doing (e.g. you're in user-space, and you've made sure no signal handlers can asynchronously use the stack.) stack-pointer * 4 + large-constant is not something that a normal program would ever do.

通常您可以在一条 LEA 指令中执行此操作 但 ESP 是唯一不能作为 x86 地址模式下的索引的寄存器. 参见 rbp 不允许作为 SIB 基础?(索引是寻址模式的一部分,可以应用 2 位移位计数,也就是比例因子).

Normally you could do this in one LEA instruction but ESP is the only register that can't be an index in an x86 address mode. See rbp not allowed as SIB base? (The index is the part of an addressing mode that can have a 2-bit shift count applied, aka a scale factor).

认为我们最好的办法仍然是将 ESP 复制到 EDI,然后使用 LEA:

I think our best bet is still just to copy ESP to EDI, then use LEA:

 mov  edi, esp
 lea  edi, [edi * 4 + 0x11223344]

或者您可以使用 LEA 进行复制和添加,然后然后左移,因为我们添加的值有两个零作为其低位(即它是 4 的倍数).所以我们可以将它右移 2 位而不会丢失任何位.

Or you could copy-and-add with LEA, and then left shift, because the value we're adding has two zeros as its low bits (i.e. it's a multiple of 4). So we can right shift it by 2 without losing any bits.

SHIFTED_ADD_CONSTANT equ 0x11223344 >> 2

  lea    edi, [esp + SHIFTED_ADD_CONSTANT]
  shl    edi, 2

左移之前的加法将产生前 2 位的进位,但我们即将移出这些位,所以那里的内容无关紧要.

The add before left-shifting will produce carry into the top 2 bits, but we're about to shift those bits out so it doesn't matter what's there.

这也是 2 uop,并且在 AMD Bulldozer 系列 CPU 上效率更高(GP-integer mov 没有移动消除,并且缩放索引的成本为LEA 的额外延迟周期).Zen 有 mov-elimination,但我认为 LEA 延迟仍然相同,所以两个版本都是 2 周期延迟.即使是复杂的"LEA 在 Zen 上也有 2/clock 的吞吐量,或者对于简单的 LEA(任何 ALU 端口)也有 4/clock 的吞吐量.

This is also 2 uops, and more efficient on AMD Bulldozer-family CPUs (no mov-elimination for GP-integer mov, and where a scaled index costs an extra cycle of latency for LEA). Zen has mov-elimination but I think still the same LEA latencies so both versions are 2 cycle latency. Even "complex" LEA has 2/clock throughput on Zen, or 4/clock for simple LEA (any ALU port).

但在 Intel IvyBridge 和更高版本的 CPU 上效率较低,其中 mov 可以以零延迟运行(mov 消除),并且 [edi*4 + disp32] 寻址模式仍然是一个快速的 2 组件 LEA.因此,在具有 mov-elimination 的 Intel CPU 上,第一个版本是 2 个前端 uop,一个执行单元的 1 个未融合域 uop,并且只有 1 个延迟周期.

But less efficient on Intel IvyBridge and later CPUs where the mov can run with zero latency (mov elimination), and the [edi*4 + disp32] addressing mode is still a fast 2-component LEA. So on Intel CPUs with mov-elimination, the first version is 2 front-end uops, 1 unfused-domain uop for an execution unit, and only 1 cycle of latency.

另一个 2 指令选项是使用较慢的 imul 而不是快速移位.(寻址模式使用移位:即使它被写成 * 1/2/4/8,它也被编码在机器代码中的 2 位移位计数字段中).

Another 2-instruction option is to use a slower imul instead of a fast shift. (Addressing modes use a shift: even though it's written as * 1 / 2 / 4 / 8, it's encoded in a 2-bit shift-count field in machine code).

  imul  edi, esp, 4       ; this is dumb, don't use mul/imul for powers of 2.
  add   edi, 0x11223344

imul 在现代 x86 CPU 上有 3 个周期的延迟,这非常好,但在 Pentium 3 等旧 CPU 上速度较慢.仍然不如 mov + LEA 的 1 或 2 个周期延迟,和 imul 在更少的端口上运行.

imul has 3 cycle latency on modern x86 CPUs which is pretty good, but is slower on old CPUs like Pentium 3. Still not as good as 1 or 2-cycle latency for mov + LEA, and imul runs on fewer ports.

(指令的数量通常不是要优化的东西;微指令的数量通常更重要,延迟/后端吞吐量.还有 x86 机器代码的代码大小(以字节为单位);不同的指令长度不同.)

(Number of instructions is not usually the thing to optimize for; number of uops usually matters more, and latency / back-end throughput. Also code-size in bytes of x86 machine code; different instructions are different lengths.)

这篇关于如何复制寄存器并以最少的指令数执行`x*4 + constant`的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆