x86_64:IMUL 比 2x SHL + 2x ADD 快吗? [英] x86_64: is IMUL faster than 2x SHL + 2x ADD?

查看:18
本文介绍了x86_64:IMUL 比 2x SHL + 2x ADD 快吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

/O2(发布)模式下查看 Visual Studio (2015U2) 生成的程序集时,我看到这段手动优化"的 C 代码被转换回乘法:

When looking at the assembly produced by Visual Studio (2015U2) in /O2 (release) mode I saw that this 'hand-optimized' piece of C code is translated back into a multiplication:

int64_t calc(int64_t a) {
  return (a << 6) + (a << 16) - a;
}

组装:

  imul        rdx,qword ptr [a],1003Fh  

所以我想知道这是否真的比它写的方式更快,比如:

So I was wondering if that is really faster than doing it the way it is written, something like:

  mov         rbx,qword ptr [a]  
  mov         rax,rbx  
  shl         rax,6  
  mov         rcx,rbx  
  shl         rcx,10h  
  add         rax,rcx  
  sub         rax,rbx  

我一直认为乘法总是比几次移位/加法慢?现代英特尔 x86_64 处理器不再是这种情况吗?

I was always under the impression that multiplication is always slower than a few shifts/adds? Is that no longer the case with modern Intel x86_64 processors?

推荐答案

没错,现代 x86 CPU(尤其是 Intel)具有非常高的性能乘数.
imul r, r/mimul r, r/m, imm 都是 3 周期延迟,在 Intel SnB 系列和 AMD Ryzen 上每 1c 吞吐量一个,即使对于64 位操作数大小.

That's right, modern x86 CPUs (especially Intel) have very high performance multipliers.
imul r, r/m and imul r, r/m, imm are both 3 cycle latency, one per 1c throughput on Intel SnB-family and AMD Ryzen, even for 64bit operand size.

在 AMD Bulldozer 系列上,延迟为 4c 或 6c,每 2c 一个或每 4c 一个.(64 位操作数大小的时间较慢).

On AMD Bulldozer-family, it's 4c or 6c latency, and one per 2c or one per 4c throughput. (The slower times for 64-bit operand-size).

数据来自 Agner Fog 的指令表.另请参阅 标签维基中的其他内容.

Data from Agner Fog's instruction tables. See also other stuff in the x86 tag wiki.

现代 CPU 中的晶体管预算非常庞大,并且允许以如此低的延迟进行 64 位乘法所需的硬件并行量.(需要很多个加法器来制作大型快速乘数.现代 X86 处理器实际上如何计算乘法?).

The transistor budget in modern CPUs is pretty huge, and allows for the amount of hardware parallelism needed to do a 64 bit multiply with such low latency. (It takes a lot of adders to make a large fast multiplier. How modern X86 processors actually compute multiplications?).

受功率预算而非晶体管预算的限制,这意味着可以为许多不同功能配备专用硬件,只要它们不能同时切换(https://en.wikipedia.org/wiki/Dark_silicon).例如你不能在 Intel CPU 上同时使 pext/pdep 单元、整数乘法器和向量 FMA 单元饱和,因为它们中的许多都在同一个执行端口.

Being limited by power budget, not transistor budget, means that having dedicated hardware for many different functions is possible, as long as they can't all be switching at the same time (https://en.wikipedia.org/wiki/Dark_silicon). e.g. you can't saturate the pext/pdep unit, the integer multiplier, and the vector FMA units all at once on an Intel CPU, because many of them are on the same execution ports.

有趣的事实:imul r64 也是 3c,因此您可以在 3 个周期内获得完整的 64*64 => 128b 乘法结果.不过,imul r32 是 4c 延迟和额外的 uop.我的猜测是额外的 uop/周期将来自常规 64 位乘法器的 64 位结果分成两个 32 位一半.

Fun fact: imul r64 is also 3c, so you can get a full 64*64 => 128b multiply result in 3 cycles. imul r32 is 4c latency and an extra uop, though. My guess is that the extra uop / cycle is splitting the 64bit result from the regular 64bit multiplier into two 32bit halves.

编译器通常会针对延迟进行优化,并且通常不知道如何优化短的独立依赖链以提高吞吐量,而长循环携带的依赖链会导致延迟出现瓶颈.

Compilers typically optimize for latency, and typically don't know how to optimize short independent dependency chains for throughput vs. long loop-carried dependency chains that bottleneck on latency.

gcc 和 clang3.8 及更高版本最多使用两个 LEA 指令代替 imul r, r/m, imm.我认为 gcc 将使用 imul 如果替代方案是 3 个或更多指令(不包括 mov),不过.

gcc and clang3.8 and later use up to two LEA instructions instead of imul r, r/m, imm. I think gcc will use imul if the alternative is 3 or more instructions (not including mov), though.

这是一个合理的调整选择,因为 3 条指令 dep 链与 Intel 上的 imul 长度相同.使用两个 1-cycle 指令会花费额外的 uop 以将延迟缩短 1 个周期.

That's a reasonable tuning choice, since a 3 instruction dep chain would be the same length as an imul on Intel. Using two 1-cycle instructions spends an extra uop to shorten the latency by 1 cycle.

clang3.7 及更早版本倾向于支持 imul,除了只需要单个 LEA 或移位的乘法器.因此,clang 最近更改为优化延迟而不是乘以小常数的吞吐量.(或者可能出于其他原因,例如不与仅与乘法器位于同一端口的其他事物竞争.)

clang3.7 and earlier tends to favour imul except for multipliers that only require a single LEA or shift. So clang very recently changed to optimizing for latency instead of throughput for multiplies by small constants. (Or maybe for other reasons, like not competing with other things that are only on the same port as the multiplier.)

例如Godbolt 编译器浏览器上的这段代码:

int foo (int a) { return a * 63; }
    # gcc 6.1 -O3 -march=haswell (and clang actually does the same here)
    mov     eax, edi  # tmp91, a
    sal     eax, 6    # tmp91,
    sub     eax, edi  # tmp92, a
    ret

clang3.8 及更高版本生成相同的代码.

clang3.8 and later makes the same code.

这篇关于x86_64:IMUL 比 2x SHL + 2x ADD 快吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆