如何在 x86 中仅使用 2 个连续的 leal 指令将寄存器乘以 37? [英] How to multiply a register by 37 using only 2 consecutive leal instructions in x86?

查看:23
本文介绍了如何在 x86 中仅使用 2 个连续的 leal 指令将寄存器乘以 37?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设 %edi 包含 x 并且我想仅使用 2 个连续的 leal 指令得到 37*x,我将如何处理?

Say %edi contains x and I want to end up with 37*x using only 2 consecutive leal instructions, how would I go about this?

例如要获得 45 倍,您可以这样做

For example to get 45x you would do

leal (%edi, %edi, 8), %edi   
leal (%edi, %edi, 4), %eax (to be returned)

我一生都无法弄清楚用什么数字代替 8 和 4,这样结果 (%eax) 将是 37x

I cannot for the life of me figure out what numbers to put in place of the 8 and 4 so that the result (%eax) will be 37x

推荐答案

-O3 处,gcc 将发出(Godbolt 编译器资源管理器):

At -O3, gcc will emit (Godbolt compiler explorer):

int mul37(int a)  { return a*37; }

    leal    (%rdi,%rdi,8), %eax      # eax = a * 9
    leal    (%rdi,%rax,4), %eax      # eax = a + 4*(a*9)
    ret

那是使用 37 = 9*4 + 1不会用第一个 lea 破坏原始的 a 值,所以它可以两个都用在第二个.

That's using 37 = 9*4 + 1, not destroying the original a value with the first lea so it can use both in the 2nd.

不过,您没有发现这个是很好的伙伴:最近的 clang(3.8 和更新版本)通常会使用 2 个 lea 指令而不是 imul(例如对于 *15),但它错过了这个并使用:

You're in good company in not spotting this one, though: recent clang (3.8 and newer) will normally use 2 lea instructions instead of an imul (e.g. for *15), but it misses this one and uses:

    imull   $37, %edi, %eax
    ret

它确实使用与 gcc 相同的模式来处理 *21,如 5*4 + 1.(clang3.6 及更早版本总是使用 imul,除非有一个单指令替代 shllea)

It does do *21 with the same pattern as gcc uses, as 5*4 + 1. (clang3.6 and earlier always used imul unless there was a single-instruction alternative shl or lea)

ICC 和 MSVC 也使用 imul,但他们似乎不喜欢使用 2 个 lea 指令,所以 imul 是故意"在那里的.

ICC and MSVC also use imul, but they don't seem to like using 2 lea instructions, so the imul is "on purpose" there.

有关 gcc7.2 与 clang5.0 的各种乘法器的信息,请参阅 Godbolt 链接.尝试 gcc -m32 -mtune=pentium 甚至 pentium3 看看当时 gcc 愿意使用多少指令是很有趣的.虽然 P2/P3 对于 imul r, r, i 有 4 个周期的延迟,所以这有点疯狂.Pentium 有 9 个周期 imul 并且没有 OOO 来隐藏延迟,所以努力避免它是有意义的.

See the godbolt link for a variety of multipliers with gcc7.2 vs. clang5.0. It's interesting to try gcc -m32 -mtune=pentium or even pentium3 to see how many more instructions gcc was wiling to use back then. Although P2/P3 has 4-cycle latency for imul r, r, i, so that's kinda crazy. Pentium has 9 cycle imul and no OOO to hide the latency, so it makes sense to try hard to avoid it.

mtune=silvermont 应该只愿意用一条指令替换 32 位 imul,因为它有 3 周期延迟/1c 吞吐量乘法,但解码通常是瓶颈(根据 Agner Fog,http://agner.org/optimize/).您甚至可以考虑使用 imul $64, %edi, %eax(或其他 2 的幂)而不是 mov/shl,因为 imul-immediate是一个复制和乘法.

mtune=silvermont should probably only be willing to replace 32-bit imul with a single instruction, because it has 3-cycle latency / 1c throughput multiply, but decode is often the bottleneck (according to Agner Fog, http://agner.org/optimize/). You could even consider imul $64, %edi, %eax (or other powers of 2) instead of mov/shl, because imul-immediate is a copy-and-multiply.

具有讽刺意味的是,gcc 错过了 * 45 的情况,而使用了 imul,而 clang 使用了 2 个 lea.猜猜是时候提交一些错过的优化错误报告了.如果 2 个 LEA 优于 1 个 IMUL,则应尽可能使用它们.

Ironically, gcc misses the * 45 case, and uses imul, while clang uses 2 leas. Guess it's time to file some missed-optimization bug reports. If 2 LEAs are better than 1 IMUL, they should be used wherever possible.

较旧的 clang(3.7 及更早版本)使用 imul,除非单个 lea 可以解决问题.我还没有查看变更日志,看看他们是否做了基准测试来决定支持延迟而不是吞吐量.

Older clang (3.7 and older) uses imul unless a single lea will do the trick. I haven't looked up the changelog to see if they did benchmarks to decide to favour latency over throughput.

相关:对不存在的值使用 LEAt 地址/指针?关于为什么 LEA 使用内存操作数语法和机器编码的规范答案,即使它是一个 shift+add 指令(并且在大多数现代微体系结构中运行在 ALU 上,而不是 AGU 上.)

Related: Using LEA on values that aren't addresses / pointers? canonical answer about why LEA uses memory-operand syntax and machine encoding, even though it's a shift+add instruction (and runs on an ALU, not AGU, in most modern microarchitectures.)

这篇关于如何在 x86 中仅使用 2 个连续的 leal 指令将寄存器乘以 37?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆