x86_64 - 程序集 - 循环条件和乱序 [英] x86_64 - Assembly - loop conditions and out of order

查看:18
本文介绍了x86_64 - 程序集 - 循环条件和乱序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(如果是这样,我会自己做的.)

我的问题:

为了方便,我倾向于避免使用间接/索引寻址模式.

I tend to avoid the indirect/index addressing modes for convenience.

作为替代,我经常使用立即、绝对或寄存器寻址.

As a replacement, I often use immediate, absolute or register addressing.

代码:

; %esi has the array address. Say we iterate a doubleword (4bytes) array.
; %ecx is the array elements count
(0x98767) myloop:
    ... ;do whatever with %esi
    add $4, %esi
    dec %ecx
    jnz 0x98767;

在这里,我们有一个序列化的组合(dec 和 jnz),它可以防止正确的乱序执行(依赖).

Here, we have a serialized combo(dec and jnz) which prevent proper out of order execution (dependency).

有没有办法避免这种情况/破坏dep?(我不是装配专家).

Is there a way to avoid that / break the dep? (I am not an assembly expert).

推荐答案

在针对 Intel CPU 进行优化时,始终将标志设置指令放在条件跳转指令之前(如果它是下表中列出的简单指令之一),因此它们可以在解码器中宏融合成一个微指令.

When optimizing for Intel CPUs, always put the flag-setting instruction right before the conditional jump instruction (if it's one of the simple ones listed in the table below), so they can macro-fuse into one uop in the decoders.

对于不进行宏融合的旧 CPU 来说,这样做并没有明显更糟.更早地设置标志可能会将此类 CPU 的分支错误预测惩罚缩短一倍,但无序执行意味着将 dec 提前移动几条指令不会产生真正的影响.另请参阅通过提前计算条件来避免停止流水线.为了真正有所作为,您可以在可以更简单地计算的东西上展开循环和/或分支之类的事情,理想情况下不依赖于慢速输入,因此 OoO exec 可以在处理旧的迭代时已经解决了分支循环体.即循环计数器 dep-chain 可以在主要工作之前运行.

Doing this is not significantly worse for older CPUs that don't do macro-fusion. Putting the flag-setting earlier might shorten the branch mispredict penalty by one for such CPUs, but out-of-order execution means that moving the dec a couple instruction earlier won't make a real difference. See also Avoid stalling pipeline by calculating conditional early. To really make a difference, you do stuff like unroll the loop and/or branch on something that can be calculated more simply, ideally without a dependency on a slow input, so OoO exec can have the branch already resolved while working on older iterations of the loop body. i.e. the loop counter dep-chain can run ahead of the main work.

我没有基准测试,但我认为越来越稀有的 CPU 的小缺点不能证明错过了融合 CPU 的前端吞吐量优势(解码和问题).uop 总吞吐量通常会成为瓶颈.

I don't have benchmarks, but I don't think the small downside on increasingly-rare CPUs justifies missing out on the front-end throughput benefit (decode and issue) for CPUs that do fusion. Total uop throughput can often be a bottleneck.

AMD Bulldozer/Piledriver/Steamroller 可以将 test/cmp 与任何 jcc 融合,但只能融合 test/cmp,不能融合任何其他 ALU 指令.所以肯定把它与分支进行比较.如果 Intel CPU 可以在 sandybridge-family 上进行宏融合,那么将其他东西放在分支上仍然很有价值.

AMD Bulldozer/Piledriver/Steamroller can fuse test/cmp with any jcc, but only test/cmp, not any other ALU instructions. So definitely put compares with branches. It's still valuable for Intel CPUs to put other things with branches if they can macro-fuse on sandybridge-family.

来自 Agner Fog 的 微架构指南,表 9.2(适用于 Sandybridge/Ivybridge):

From Agner Fog's microarch guide, Table 9.2 (for Sandybridge / Ivybridge):

First       | can pair with these  |  cannot pair with
instruction | (and the inverse)    |
---------------------------------------------
cmp         |jz, jc, jb, ja, jl, jg|   js, jp, jo
add, sub    |jz, jc, jb, ja, jl, jg|   js, jp, jo
adc, sbb    |none                  |
inc, dec    |jz, jl, jg            |   jc, jb, ja, js, jp, jo
test        | all                  |
and         | all                  |
or, xor, not, neg | none           |
shift, rotate     | none           |

Table 9.2. Instruction fusion

所以基本上,inc/dec 可以与 jcc 进行宏融合,只要条件仅取决于由 inc/dec<修改的位/代码>.

So basically, inc/dec can macro-fuse with a jcc as long as the condition only depends on bits that are modified by inc/dec.

(否则,它们不会进行宏融合,并且您会插入一个额外的 uop 以合并标志(例如,当您在编写 al 后读取 eax 时).或者在早期的 CPU 上,部分标志会停止.)

(Otherwise, they don't macro-fuse, and you get an extra uop inserted to merge the flags (like when you read eax after writing al). Or on earlier CPUs, a partial-flags stall.)

Core2/Nehalem 的宏融合能力更有限(仅适用于 JCC 组合更有限的 CMP/TEST),Core2 根本无法在 64 位模式下进行宏融合.

Core2 / Nehalem was more limited in macro-fusion capability (just for CMP/TEST with more limited JCC combinations), and Core2 couldn't macro-fuse in 64bit mode at all.

如果您还没有阅读 Agner Fog 的优化 asm 和 C 指南,也可以阅读.它们充满了基本知识.

Read Agner Fog's optimizing asm and C guides, too, if you haven't already. They're full of essential knowledge.

这篇关于x86_64 - 程序集 - 循环条件和乱序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆