高效的装配乘法 [英] Efficient Assembly multiplication

查看:84
本文介绍了高效的装配乘法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

不久前开始练习汇编. 我想通过组装命令lea和shift实现高效的乘法. 我想编写一个c程序,该程序将调用适合用户接收的常量参数的汇编过程,并将用户接收的另一个参数乘以该常量.

如何使此代码有效?
我可以将哪些数字分组(如果有的话)以适合同一过程? 例如,我认为我可以将2,4,8,...分组为相同的程序,因为它们只是左移1,2,3.

但是我很难找到其他这样的群体以及其他数字,而负数又如何呢……

解决方案

本练习的有趣部分是找到使用1或2条LEA,SHL和/或ADD/SUB指令来实现乘以各种常量的乘法的方法. /p>

实际进行一次乘法的动态调度不是很有趣,这意味着实际的JIT编译,或者已经在庞大的微型代码块表中提供了所有可能的序列. (类似于switch语句.)

相反,我建议编写一个C或Python或带有1个整数arg的任何函数,并作为输出生成实现x * n的asm源文本,其中n是整数arg.例如,您可能会在编译器中找到的一种函数,可以优化乘常数.

您可能希望以一种自动化的方式进行测试,例如通过与几个不同的x值与纯C x * n进行比较.


如果您无法按照2条指令(或3条指令中的一个为mov)来完成工作,那是不值得的.现代的x86在硬件上具有可笑的高效乘法. imul reg, r/m, imm 是1个uop,3个周期的延迟,已完全流水线化. (AMD从Zen开始,AMD从Core2或Nehalem左右开始.)关键路径长度为1或2个周期(假设您愿意,假设零延迟mov,例如IvyBridge +和Zen)无法完成的所有操作,这就是您的后备. )

或者,如果您想探索更复杂的序列,可以在回退之前设置更高的阈值,例如旨在在推土机系列上实现64位乘法(6个周期的延迟). https://agner.org/optimize/.甚至是P5 Pentium,其中imul需要9个周期(不可配对).


要寻找的样式

整数乘法归结为1个操作数的移位副本,其中另一个操作数具有1位. (请参见用于实现乘以运行时变量值,通过移位并一次对每个位进行加法检查的算法.)

最简单的模式当然只有一个设置位,即2的幂.那只是左移这很容易检查:n & (n-1) == 0,当n != 0时.

任何具有正好设置的2位的东西最多只能有2个移位和一个加法. (GNU C __builtin_popcount(n)对设置的位进行计数.在x86 asm中,SSE4.2 popcnt).

GNU C __builtin_ctz查找最低设置位的位索引.在您知道的非零数字上使用它会为您提供低位的移位计数.在x86 asm中,bsf/tzcnt.

要清除该最低设置位并暴露"下一个最低位,可以执行n &= n-1;.在x86 asm中, BMI1 blsr 或LEA/AND.


要查找的另一个有趣的模式是2 n + -1 . +1的情况已被2位的情况所涵盖,但低位的移位计数为0;低位的移位计数为0.无需换档.最多有3个班次,您可以在一个LEA中完成.

可以通过检查n+1是否为2的幂(仅设置1位)来检测2 ^ n-1.稍微复杂一点的(2^n - 1) * 2^m可以通过此技巧加上另一种转换来完成.因此,您可以尝试右移以将最低的设置位带到底部,然后寻找技巧.

GCC以2 ^ n-1的方式执行此操作:

mul15:              # gcc -O3 -mtune=bdver2
        mov     eax, edi
        sal     eax, 4
        sub     eax, edi
        ret

clang效率更高(对于比例索引仍然只有1个周期延迟的Intel CPU):

mul15:             # clang -O3 -mtune=bdver2
        lea     eax, [rdi + 4*rdi]
        lea     eax, [rax + 2*rax]
        ret


结合这些模式

也许将您的数字分解为主要因素,并寻找使用构建基块来组合这些因素的方法.

但这不是唯一的方法.您可以像x*5*2 + x一样执行x*11,就像GCC和Clang一样(这很像如何仅使用37将寄存器乘以37 x86中有2条连续的Leal指令?)

        lea     eax, [rdi + 4*rdi]
        lea     eax, [rdi + 2*rax]

x * 17也有2种方法. GCC和Clang就是这样:

mul17:
        mov     eax, edi
        sal     eax, 4
        add     eax, edi
        ret

但是,即使使用-march=sandybridge(无移动消除,一周期LEA [reg + reg*scale]),他们也无法使用的另一种方法是:

mul17:
        lea    eax,  [rdi + 8*rdi]  ; x*9
        lea    eax,  [rax + 8*rdi]  ; x*9 + x*8 = x*17

因此,我们没有添加乘数,而是添加了不同的乘数以构成总乘数.


除了如何简单地像2个置位或2 ^ n +-1这样的简单序列之外,我没有什么好的编程建议.如果您很好奇,请查看GCC或LLVM源代码用于执行这些优化的功能;找到很多棘手的东西.

对于使用LEA的2次幂与特定于x86的目标代码,以及在决定退回到imul之前需要确定多少条指令的阈值,工作可能会在目标无关性优化遍历之间划分.立即.


负数

x * -8可以用x - x*9完成.我认为即使x*9溢出也可能是安全的,但您必须仔细检查.


查看编译器输出

#define MULFUN(c) int mul##c(int x) { return x*c; }
MULFUN(9)
MULFUN(10)
MULFUN(11)
MULFUN(12)
...

我说的是 解决方案

The interesting part of this exercise is finding ways to use 1 or 2 LEA, SHL, and/or ADD/SUB instructions to implement multiplies by various constants.

Actually dispatching on the fly for a single multiply isn't very interesting, and would mean either actual JIT compiling or that you have every possible sequence already present in a giant table of tiny blocks of code. (Like switch statements.)

Instead I'd suggest writing a C or Python or whatever function that takes 1 integer arg, and as output produces the asm source text that implements x * n where n is the integer arg. i.e. a function like you might find in a compiler that optimizes a multiply-by-constant.

You might want to cook up an automated way to test this, e.g. by comparing against a pure C x * n for a couple different x values.


If you can't get the job done in 2 instructions (or 3 with one of them being mov), it's not worth it. Modern x86 has ridiculously efficient multiply in hardware. imul reg, r/m, imm is 1 uop, 3 cycle latency, fully pipelined. (AMD since Zen, Intel since Core2 or Nehalem or so.) That's your fallback for anything that you can't get done with a critical path length of 1 or 2 cycles (assuming zero-latency mov if you want, like IvyBridge+ and Zen.)

Or you could set a higher threshold before fallback if you want to explore more complicated sequences, e.g. aim for 64-bit multiply on Bulldozer-family (6 cycle latency). https://agner.org/optimize/. Or even P5 Pentium where imul takes 9 cycles (not pairable).


Patterns to look for

Integer multiply boils down to adding up shifted copies of 1 operand where the other operand has 1 bits. (See the algorithm for implementing multiply by runtime-variable values, by shift and add checking each bit one at a time.)

The easiest pattern is of course only a single set bit, i.e. a power of 2; then it's just a left shift. This is easy to check for: n & (n-1) == 0, when n != 0.

Anything with exactly 2 set bits is at most 2 shifts and an add. (GNU C __builtin_popcount(n) counts set bits. In x86 asm, SSE4.2 popcnt).

GNU C __builtin_ctz finds the bit-index of the lowest set bit. Using it on a number you know is non-zero will give you the shift count for the low bit. In x86 asm, bsf / tzcnt.

To clear that lowest set bit and "expose" the next-lowest, you can do n &= n-1;. In x86 asm, BMI1 blsr or LEA / AND.


Another interesting pattern to look for is 2n +- 1. The +1 case is already covered by the 2-set-bits case, but the shift count for the low bit is 0; no shift needed. With shift counts up to 3, you can do it in one LEA.

You can detect 2^n - 1 by checking if n+1 is a power of 2 (has only 1 bit set). Somewhat more complex, (2^n - 1) * 2^m can be done with this trick plus another shift. So you could try right-shifting to bring the lowest set bit to the bottom then looking for tricks.

GCC does this the 2^n - 1 way:

mul15:              # gcc -O3 -mtune=bdver2
        mov     eax, edi
        sal     eax, 4
        sub     eax, edi
        ret

clang is more efficient (for Intel CPUs where scaled-index is still only 1 cycle latency):

mul15:             # clang -O3 -mtune=bdver2
        lea     eax, [rdi + 4*rdi]
        lea     eax, [rax + 2*rax]
        ret


Combining these patterns

Maybe factorize your number into its prime factors and look for ways to use your building blocks to do combinations of those factors.

But this isn't the only approach. You can do x*11 as x*5*2 + x, like GCC and Clang do this (which is a lot like How to multiply a register by 37 using only 2 consecutive leal instructions in x86?)

        lea     eax, [rdi + 4*rdi]
        lea     eax, [rdi + 2*rax]

There are 2 approaches for x*17 as well. GCC and Clang do it this way:

mul17:
        mov     eax, edi
        sal     eax, 4
        add     eax, edi
        ret

But another way which they fail to use even with -march=sandybridge (no mov-elimination, 1-cycle LEA [reg + reg*scale]) is:

mul17:
        lea    eax,  [rdi + 8*rdi]  ; x*9
        lea    eax,  [rax + 8*rdi]  ; x*9 + x*8 = x*17

So instead of multiplying factors, we're adding different multipliers to make the total multiplier.


I don't have any great suggestions how to programmatically search for these sequences beyond the simple ones like 2 set bits, or 2^n +- 1. If you're curious, have a look in GCC or LLVM source code for the functions that do these optimizations; the find a lot of tricky ones.

The work might be split between target-neutral optimization passes for powers of 2 vs. x86-specific target code for using LEA, and for deciding on a threshold of how many instructions is worth it before falling back to imul-immediate.


Negative numbers

x * -8 could be done with x - x*9. I think that might be safe even if x*9 overflows but you'd have to double-check on that.


Look at compiler output

#define MULFUN(c) int mul##c(int x) { return x*c; }
MULFUN(9)
MULFUN(10)
MULFUN(11)
MULFUN(12)
...

I put that on the Godbolt compiler explorer for the x86-64 System V ABI (first arg in RDI, like the above examples). With gcc and clang -O3. I used -mtune=bdver2 (Piledriver) because it has somewhat slower multiply than Intel or Zen. This encourages GCC and Clang to avoid imul slightly more aggressively.

I didn't try if long / uint64_t would change that (6 cycle instead of 4 cycle latency, and half the throughput.) Or if an older uarch like -mtune=nocona (Pentium 4) would make a difference. -mtune=bdver2 did make a difference vs. the default tune=generic for GCC at least.

If you use -m32, you can use even older uarches like -mtune=pentium (in-order P5). I'd recommend -mregparm=3 for that so args are still passed in registers, not the stack.

这篇关于高效的装配乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆