x86_64的 - 装配 - 循环条件和乱序 [英] x86_64 - Assembly - loop conditions and out of order

查看：154 发布时间：2016/7/18 19:51:18 loops assembly condition x86-64

本文介绍了x86_64的 - 装配 - 循环条件和乱序的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

（我是这样的话，我会做我自己。的）

我的问题：

我倾向于避免间接/索引寻址为了方便模式。

I tend to avoid the indirect/index addressing modes for convenience.

作为替代，我经常使用即时，绝对或寄存器寻址。

As a replacement, I often use immediate, absolute or register addressing.

在code：

; %esi has the array address. Say we iterate a doubleword (4bytes) array.
; %ecx is the array elements count
(0x98767) myloop:
    ... ;do whatever with %esi
    add $4, %esi
    dec %ecx
    jnz 0x98767;

在这里，我们有一个序列化的组合（DEC和JNZ）的prevent正确的乱序执行（依赖）。

Here, we have a serialized combo(dec and jnz) which prevent proper out of order execution (dependency).

有没有办法避免/打破DEP？（我不是一个组装专家）。

Is there a way to avoid that / break the dep? (I am not an assembly expert).

推荐答案

在优化英特尔处理器，始终把标记设置指令的条件跳转指令前右，这样他们就可以宏观融合成一条微在德codeRS。这对AMD的CPU没有显著恶化。你可能被一个CPU的不宏观保险丝比较和分支对缩短分支误predict罚款。在法正确predicted情况下，这是一个额外的运算以空间的重新排序缓冲器一个额外的周期。我不认为任何这些，不问分裂比较和分支指令。我认为，英特尔在额外的UOP吞吐量更显著。

When optimizing for Intel CPUs, always put the flag-setting instruction right before the conditional jump instruction, so they can macro-fuse into one uop in the decoders. It's not significantly worse for AMD CPUs. You might shorten the branch mispredict penalty by one for CPUs that don't macro-fuse compare-and-branch pairs. In the correctly-predicted case, it's one extra op taking space in the re-order buffer for one extra cycle. I don't think either of those justifies splitting up compare-and-branch instructions. I think the extra uop throughput on Intel is more significant.

AMD推土机/打桩机/压路机可以融合测试/ CMP 任何江铜，但只有测试/ CMP ，没有任何ALU指令。所以，一定会把树枝进行比较。

AMD Bulldozer/Piledriver/Steamroller can fuse test/cmp with any jcc, but only test/cmp, not any ALU instructions. So definitely put compares with branches.

瓦格纳雾的 microarch指南，表9.2（SandyBridge的用于/ IvyBridge的）：

From Agner Fog's microarch guide, Table 9.2 (for Sandybridge / Ivybridge):

First       | can pair with these  |  cannot pair with
instruction | (and the inverse)    |
---------------------------------------------
cmp         |jz, jc, jb, ja, jl, jg|   js, jp, jo
add, sub    |jz, jc, jb, ja, jl, jg|   js, jp, jo
adc, sbb    |none                  |
inc, dec    |jz, jl, jg            |   jc, jb, ja, js, jp, jo
test        | all                  |
and         | all                  |
or, xor, not, neg | none           |
shift, rotate     | none           |

Table 9.2. Instruction fusion

所以基本上， /减可以宏观保险丝以江铜只要条件仅依赖于由 /减修改位。（否则，他们没有宏观保险丝，你会得到插入合并标志（当你读到这样一个额外的UOP公司 EAX 写在人），或者在更早的CPU，一个局部的标志搪塞。）

So basically, inc/dec can macro-fuse with a jcc as long as the condition only depends on bits that are modified by inc/dec. (Otherwise, they don't macro-fuse, and you get an extra uop inserted to merge the flags (like when you read eax after writing al). Or on earlier CPUs, a partial-flags stall.)

酷睿2 / Nehalem的是宏观的融合能力较为有限，而酷睿2不能宏观保险丝，64位模式的。

Core2 / Nehalem was more limited in macro-fusion capability, and Core2 couldn't macro-fuse in 64bit mode at all.

阅读瓦格纳雾的优化汇编和C导游也一样，如果你还没有。他们充分必要的知识的。

Read Agner Fog's optimizing asm and C guides, too, if you haven't already. They're full of essential knowledge.

这篇关于x86_64的 - 装配 - 循环条件和乱序的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

x86_64的 - 装配 - 循环条件和乱序 [英] x86_64 - Assembly - loop conditions and out of order

问题描述

推荐答案

相关文章

.NET Framework最新文章

热门教程

热门工具

登录关闭

x86_64的 - 装配 - 循环条件和乱序 [英] x86_64 - Assembly - loop conditions and out of order

问题描述

推荐答案

相关文章

.NET Framework最新文章

热门教程

热门工具

登录 关闭

登录关闭