x86_64的:是IMUL优于2x + SHL ADD 2倍快? [英] x86_64: is IMUL faster than 2x SHL + 2x ADD?
问题描述
当在 / O2
(释放)模式,由Visual Studio(2015U2)生成的汇编看我看到,这种手工优化片C $ C $的c为转换回成乘积:
When looking at the assembly produced by Visual Studio (2015U2) in /O2
(release) mode I saw that this 'hand-optimized' piece of C code is translated back into a multiplication:
int64_t calc(int64_t a) {
return (a << 6) + (a << 16) - a;
}
大会:
imul rdx,qword ptr [a],1003Fh
所以我在想,如果这是真的比做得它是书面的方式更快,是这样的:
So I was wondering if that is really faster than doing it the way it is written, something like:
mov rbx,qword ptr [a]
mov rax,rbx
shl rax,6
mov rcx,rbx
shl rcx,10h
add rax,rcx
sub rax,rbx
我总是IM pression下乘法总是比几个班次/加慢?是不再与现代英特尔处理器x86_64的情况?
I was always under the impression that multiplication is always slower than a few shifts/adds? Is that no longer the case with modern Intel x86_64 processors?
推荐答案
这是正确的,现代的x86 CPU(特别是英特尔)具有非常高的性能倍增。结果 IMUL R,R / M
是3C延迟,一是每1℃吞吐量英特尔SNB系列,甚至对64位运算的大小。
That's right, modern x86 CPUs (especially Intel) have very high performance multipliers.
imul r, r/m
is 3c latency, one per 1c throughput on Intel SnB-family, even for 64bit operand size.
在AMD推土机系列,它是4C或6C延迟和2C每一个或每4C吞吐量之一。 (用于64位操作数大小较慢倍)。
On AMD Bulldozer-family, it's 4c or 6c latency, and one per 2c or one per 4c throughput. (Slower times for 64bit operand-size).
瓦格纳雾的指示表数据的。另请参见其他的东西在 86 标记维基。
Data from Agner Fog's instruction tables. See also other stuff in the x86 tag wiki.
在现代CPU的晶体管预算为pretty巨大,并允许做这样的低延迟64位乘以所需的硬件并行的量。 (它需要的很多的加法器做出的大型快速乘数)。
The transistor budget in modern CPUs is pretty huge, and allows for the amount of hardware parallelism needed to do a 64 bit multiply with such low latency. (It takes a lot of adders to make a large fast multiplier).
正由功率预算,而不是晶体管预算的限制,是指具有许多不同的功能的专用硬件是可能的,只要它们不能全部同时切换。 (例如,你不能饱和 PEXT
/ PDEP
单元,整数乘法,以及矢量FMA单位全部一次,因为很多人都在相同的执行端口)。
Being limited by power budget, not transistor budget, means that having dedicated hardware for many different functions is possible, as long as they can't all be switching at the same time. (e.g. you can't saturate the pext
/pdep
unit, the integer multiplier, and the vector FMA units all at once, because many of them are on the same execution ports).
有趣的事实: IMUL R64
也是3C,这样你就可以得到一个完整的64 * 64 => 128B乘法导致3个周期。 IMUL R32
是4C延迟和一个额外的UOP,虽然。我的猜测是多余的UOP /周期,从通常的64位乘法器分裂64位结果分为两半32位
Fun fact: imul r64
is also 3c, so you can get a full 64*64 => 128b multiply result in 3 cycles. imul r32
is 4c latency and an extra uop, though. My guess is that the extra uop / cycle is splitting the 64bit result from the regular 64bit multiplier into two 32bit halves.
GCC仍然使用最多两个 LEA
指令而不是 IMUL R,R /平方米,IMM
,但铛偏向于 IMUL
。我想gcc会使用 IMUL
如果选择是3个或更多的指令(不包括 MOV
),虽然。
gcc still uses up to two LEA
instructions instead of imul r, r/m, imm
, but clang tends to favour imul
. I think gcc will use imul
if the alternative is 3 or more instructions (not including mov
), though.
这是一个合理的调整选择,因为一个3指令出发链将是相同的长度 IMUL
英特尔。使用两个单周期指令花费额外的UOP 1周期缩短了等待时间。
That's a reasonable tuning choice, since a 3 instruction dep chain would be the same length as an imul
on Intel. Using two 1-cycle instructions spends an extra uop to shorten the latency by 1 cycle.
例如。 <一href=\"http://gcc.godbolt.org/#compilers:!((compiler:g6,options:'-xc+-std%3Dgnu11+-Wall+-Wextra+-fverbose-asm+-O3+-march%3Dhaswell',source:'int+foo+(int+a)+%7B+return+a+*+63%3B+%7D')),filterAsm:(commentOnly:!t,directives:!t,intel:!t,labels:!t),version:3\"相对=nofollow>在Godbolt编译探险这code:
int foo (int a) { return a * 63; }
# gcc 6.1 -O3 -march=haswell (and clang actually does the same here)
mov eax, edi # tmp91, a
sal eax, 6 # tmp91,
sub eax, edi # tmp92, a
ret
这篇关于x86_64的:是IMUL优于2x + SHL ADD 2倍快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!