缩放索引寻址模式是一个好主意吗? [英] Are scaled-index addressing modes a good idea?

查看:210
本文介绍了缩放索引寻址模式是一个好主意吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑下面的代码:

  void foo(int * __restrict__ a)
{
int一世; int val = 0;
for(i = 0; i <100; i ++){
val = 2 * i;
a [i] = val;


符合(具有最大优化但不展开或矢量化)...



GCC 7.2 :

  foo(int *):
xor eax,eax
.L2:
mov DWORD PTR [rdi],eax
add eax,2
add rdi,4
cmp eax,200
jne .L2
rep ret

铿锵5.0:

  foo(int *):#@foo(int *)
xor eax,eax
.LBB0_1:#=>此内部循环标题:深度= 1
mov dword ptr [rdi + 2 * rax],eax
add rax,2
cmp rax,200
jne .LBB0_1
ret

GCC和clang的方法有哪些优缺点?即一个额外的变量单独递增,通过更复杂的寻址模式相乘?



注意:

$ ul $ b

  • 这个问题还涉及 this one 与大致相同的代码,但使用 float 的而不是 int 的s 。

    解决方案

    是的,利用x86寻址模式的强大功能来保存uops






    索引寻址模式通常便宜。在最糟糕的情况下,他们会在前端花费一个额外的uop(在某些情况下,在Intel SnB系列CPU上使用 ),和/或阻止使用port7的存储地址uop(仅支持基本+位移寻址模式)。请参阅 Agner Fog的微型pdf ,以及 David Kanter的Haswell 写下了关于英特尔在Haswell添加的端口7上的store-AGU的更多信息。
    在Haswell +如果你需要你的循环来维持每个时钟超过2个内存操作,那么就避免索引存储。



    充其量只是它的代码尺寸成本机器码编码中的额外字节。 (具有索引寄存器需要编码中的SIB(缩放索引基)字节)。

    更常见的唯一损失是1个额外周期的负载使用延迟vs 。一个简单的 [base + 0-2047] 寻址模式,在Intel Sandybridge系列CPU上。



    如果您要在多条指令中使用该寻址模式,则只需使用额外的指令来避免索引寻址模式。 (例如加载/修改/存储)。






    扩展索引是免费的(至少在现代CPU上)如果您已经使用2寄存器寻址模式。对于 lea ,Agner Fog的表格列出了AMD Ryzen具有2c延迟,而对于 lea ,每个时钟吞吐量只有2个缩放 - 索引寻址模式(或3分量),否则1c延迟和 0.25c 吞吐量。例如 lea rax,[rcx + rdx] lea rax,[rcx + 2 * rdx] 快,但不会足以值得使用额外的指令来代替)。由于某种原因,Ryzen也不喜欢64位模式下的32位目标。但最糟糕的LEA依然不错。无论如何,与加载的地址模式选择大多数无关,因为大多数CPU(除了按顺序的Atom)在ALU上运行LEA,而不是用于实际加载/存储的AGU。

    主要问题是在一个寄存器的非缩放之间(因此它可以是机器码编码中的基本寄存器: [base + idx * scale + disp] )或两个寄存器。请注意,对于英特尔的微融合限制, [disp32 + idx * scale] (例如索引静态数组)是一种索引寻址模式。






    这两个函数都不是完全最优的(即使不考虑展开或向量化),但clang的外观非常接近。



    clang可以做得更好的唯一方法是通过避免REX前缀<2 $ / code> add eax,2 和<$来节省2个字节的代码大小c $ c> cmp eax,200 。它将所有操作数提升为64位,因为它使用了指针,并且我猜测C语言循环并不需要它们来包装,所以在asm中它使用的是64位无处不在。这是毫无意义的; 32位操作总是至少与64一样快,并且隐式零扩展是免费的。但是这只花费2个字节的代码大小,并且除了间接的前端效果之外,其它性能都没有成本。



    您已经构建了循环,以便编译器需要在寄存器中保留一个特定的值,并且不能将问题完全转化为指针递增+与结束指针进行比较(编译器在除了数组索引之外不需要循环变量时通常会这样做) p>

    您也无法转换为将负向索引向上计数到零(编译器从不这样做,但将循环开销减少到总共1个宏 - 熔合加+分支uop在Intel CPU上(可以熔接 add + jcc ,而AMD只能熔接测试或cmp / jcc)。

    Clang已经做得很好,注意到它可以使用 2 * var 作为数组索引(以字节为单位)。这是一个很好的优化tune = generic。索引商店将在英特尔Sandybridge和Ivybridge上解压缩,但保持微融合在Haswell和以后。 (在Nehalem,Silvermont,Ryzen,Jaguar等其他处理器上,没有任何缺点)。

    gcc的循环在循环中有1个额外的uop。 Core2 / Nehalem在理论上每个时钟的运行速度仍然可以达到1个,但是每个时钟周期的运行速度可以达到4个。 (实际上,Core2不能在64位模式下对cmp / jcc进行宏观融合,因此它在前端方面存在瓶颈)。

    Consider the following code:

    void foo(int* __restrict__ a)
    {
        int i; int val = 0;
        for (i = 0; i < 100; i++) {
            val = 2 * i;
            a[i] = val;
        }
    }
    

    This complies (with maximum optimization but no unrolling or vectorization) into...

    GCC 7.2:

    foo(int*):
            xor     eax, eax
    .L2:
            mov     DWORD PTR [rdi], eax
            add     eax, 2
            add     rdi, 4
            cmp     eax, 200
            jne     .L2
            rep ret
    

    clang 5.0:

    foo(int*): # @foo(int*)
      xor eax, eax
    .LBB0_1: # =>This Inner Loop Header: Depth=1
      mov dword ptr [rdi + 2*rax], eax
      add rax, 2
      cmp rax, 200
      jne .LBB0_1
      ret
    

    What are the pros and cons of GCC's vs clang's approach? i.e. an extra variable incremented separately, vs multiplying via a more complex addressing mode?

    Notes:

    • This question also relates to this one with about the same code, but with float's rather than int's.

    解决方案

    Yes, take advantage of the power of x86 addressing modes to save uops.


    Indexed addressing modes are generally cheap. At worst they cost one extra uop for the front-end (on Intel SnB-family CPUs in some situations), and/or prevent a store-address uop from using port7 (which only supports base + displacement addressing modes). See Agner Fog's microarch pdf, and also David Kanter's Haswell write-up, for more about the store-AGU on port7 which Intel added in Haswell.
    On Haswell+, if you need your loop to sustain more than 2 memory ops per clock, then avoid indexed stores.

    At best they're free other than the code-size cost of the extra byte in the machine-code encoding. (Having an index register requires a SIB (Scale Index Base) byte in the encoding).

    More often the only penalty is the 1 extra cycle of load-use latency vs. a simple [base + 0-2047] addressing mode, on Intel Sandybridge-family CPUs.

    It's usually only worth using an extra instruction to avoid an indexed addressing mode if you're going to use that addressing mode in multiple instructions. (e.g. load / modify / store).


    Scaling the index is free (on modern CPUs at least) if you're already using a 2-register addressing mode. For lea, Agner Fog's table lists AMD Ryzen as having 2c latency and only 2 per clock throughput for lea with scaled-index addressing modes (or 3-component), otherwise 1c latency and 0.25c throughput. e.g. lea rax, [rcx + rdx] is faster than lea rax, [rcx + 2*rdx], but not by enough to be worth using extra instructions instead.) Ryzen also doesn't like a 32-bit destination in 64-bit mode, for some reason. But the worst-case LEA is still not bad at all. And anyway, mostly unrelated to address-mode choice for loads, because most CPUs (other than in-order Atom) run LEA on the ALUs, not the AGUs used for actual loads/stores.

    The main question is between one-register unscaled (so it can be a "base" register in the machine-code encoding: [base + idx*scale + disp]) or two-register. Note that for Intel's micro-fusion limitations, [disp32 + idx*scale] (e.g. indexing a static array) is an indexed addressing mode.


    Neither function is totally optimal (even without considering unrolling or vectorization), but clang's looks very close.

    The only thing clang could do better is save 2 bytes of code size by avoiding the REX prefixes with add eax, 2 and cmp eax, 200. It promoted all the operands to 64-bit because it's using them with pointers and I guess proved that the C loop doesn't need them to wrap, so in asm it uses 64-bit everywhere. This is pointless; 32-bit operations are always at least as fast as 64, and implicit zero-extension is free. But this only costs 2 bytes of code-size, and costs no performance other than indirect front-end effects from that.

    You've constructed your loop so the compiler needs to keep a specific value in registers and can't totally transform the problem into just a pointer-increment + compare against an end pointer (which compilers often do when they don't need the loop variable for anything except array indexing).

    You also can't transform to counting a negative index up towards zero (which compilers never do, but reduces the loop overhead to a total of 1 macro-fused add + branch uop on Intel CPUs (which can fuse add + jcc, while AMD can only fuse test or cmp / jcc).

    Clang has done a good job noticing that it can use 2*var as the array index (in bytes). This is a good optimization for tune=generic. The indexed store will un-laminate on Intel Sandybridge and Ivybridge, but stay micro-fused on Haswell and later. (And on other CPUs, like Nehalem, Silvermont, Ryzen, Jaguar, or whatever, there's no disadvantage.)

    gcc's loop has 1 extra uop in the loop. It can still in theory run at 1 store per clock on Core2 / Nehalem, but it's right up against the 4 uops per clock limit. (And actually, Core2 can't macro-fuse the cmp/jcc in 64-bit mode, so it bottlenecks on the front-end).

    这篇关于缩放索引寻址模式是一个好主意吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆