x86_64的:是IMUL优于2x + SHL ADD 2倍快? [英] x86_64: is IMUL faster than 2x SHL + 2x ADD?

查看:247
本文介绍了x86_64的:是IMUL优于2x + SHL ADD 2倍快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当在 / O2 (释放)模式,由Visual Studio(2015U2)生成的汇编看我看到,这种手工优化片C $ C $的c为转换回成乘积:

When looking at the assembly produced by Visual Studio (2015U2) in /O2 (release) mode I saw that this 'hand-optimized' piece of C code is translated back into a multiplication:

int64_t calc(int64_t a) {
  return (a << 6) + (a << 16) - a;
}

大会:

  imul        rdx,qword ptr [a],1003Fh  

所以我在想,如果这是真的比做得它是书面的方式更快,是这样的:

So I was wondering if that is really faster than doing it the way it is written, something like:

  mov         rbx,qword ptr [a]  
  mov         rax,rbx  
  shl         rax,6  
  mov         rcx,rbx  
  shl         rcx,10h  
  add         rax,rcx  
  sub         rax,rbx  

我总是IM pression下乘法总是比几个班次/加慢?是不再与现代英特尔处理器x86_64的情况?

I was always under the impression that multiplication is always slower than a few shifts/adds? Is that no longer the case with modern Intel x86_64 processors?

推荐答案

这是正确的,现代的x86 CPU(特别是英特尔)具有非常高的性能倍增。结果
IMUL R,R / M 是3C延迟,一是每1℃吞吐量英特尔SNB系列,甚至对64位运算的大小。

That's right, modern x86 CPUs (especially Intel) have very high performance multipliers.
imul r, r/m is 3c latency, one per 1c throughput on Intel SnB-family, even for 64bit operand size.

在AMD推土机系列,它是4C或6C延迟和2C每一个或每4C吞吐量之一。 (用于64位操作数大小较慢倍)。

On AMD Bulldozer-family, it's 4c or 6c latency, and one per 2c or one per 4c throughput. (Slower times for 64bit operand-size).

瓦格纳雾的指示表数据的。另请参见其他的东西在 86 标记维基。

Data from Agner Fog's instruction tables. See also other stuff in the x86 tag wiki.

在现代CPU的晶体管预算为pretty巨大,并允许做这样的低延迟64位乘以所需的硬件并行的量。 (它需要的很多的加法器做出的大型快速乘数)。

The transistor budget in modern CPUs is pretty huge, and allows for the amount of hardware parallelism needed to do a 64 bit multiply with such low latency. (It takes a lot of adders to make a large fast multiplier).

正由功率预算,而不是晶体管预算的限制,是指具有许多不同的功能的专用硬件是可能的,只要它们不能全部同时切换。 (例如,你不能饱和 PEXT / PDEP 单元,整数乘法,以及矢量FMA单位全部一次,因为很多人都在相同的执行端口)。

Being limited by power budget, not transistor budget, means that having dedicated hardware for many different functions is possible, as long as they can't all be switching at the same time. (e.g. you can't saturate the pext/pdep unit, the integer multiplier, and the vector FMA units all at once, because many of them are on the same execution ports).

有趣的事实: IMUL R64 也是3C,这样你就可以得到一个完整的64 * 64 => 128B乘法导致3个周期。 IMUL R32 是4C延迟和一个额外的UOP,虽然。我的猜测是多余的UOP /周期,从通常的64位乘法器分裂64位结果分为两半32位

Fun fact: imul r64 is also 3c, so you can get a full 64*64 => 128b multiply result in 3 cycles. imul r32 is 4c latency and an extra uop, though. My guess is that the extra uop / cycle is splitting the 64bit result from the regular 64bit multiplier into two 32bit halves.

GCC仍然使用最多两个 LEA 指令而不是 IMUL R,R /平方米,IMM ,但铛偏向于 IMUL 。我想gcc会使用 IMUL 如果选择是3个或更多的指令(​​不包括 MOV ),虽然。

gcc still uses up to two LEA instructions instead of imul r, r/m, imm, but clang tends to favour imul. I think gcc will use imul if the alternative is 3 or more instructions (not including mov), though.

这是一个合理的调整选择,因为一个3指令出发链将是相同的长度 IMUL 英特尔。使用两个单周期指令花费额外的UOP 1周期缩短了等待时间。

That's a reasonable tuning choice, since a 3 instruction dep chain would be the same length as an imul on Intel. Using two 1-cycle instructions spends an extra uop to shorten the latency by 1 cycle.

例如。 <一href=\"http://gcc.godbolt.org/#compilers:!((compiler:g6,options:'-xc+-std%3Dgnu11+-Wall+-Wextra+-fverbose-asm+-O3+-march%3Dhaswell',source:'int+foo+(int+a)+%7B+return+a+*+63%3B+%7D')),filterAsm:(commentOnly:!t,directives:!t,intel:!t,labels:!t),version:3\"相对=nofollow>在Godbolt编译探险这code:

int foo (int a) { return a * 63; }
    # gcc 6.1 -O3 -march=haswell (and clang actually does the same here)
    mov     eax, edi  # tmp91, a
    sal     eax, 6    # tmp91,
    sub     eax, edi  # tmp92, a
    ret

这篇关于x86_64的:是IMUL优于2x + SHL ADD 2倍快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆