英特尔Nehalem微体系结构可以实现的最大IPC是多少? [英] What is the maximum possible IPC can be achieved by Intel Nehalem Microarchitecture?

查看:113
本文介绍了英特尔Nehalem微体系结构可以实现的最大IPC是多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否估计了英特尔Nehalem架构可以实现的每个周期的最大指令数?另外,影响每个周期最大指令数的瓶颈是什么?

Is there an estimation for the maximum Instructions Per Cycle achievable by the Intel Nehalem Architecture? Also, what is the bottleneck that effects the maximum Instructions Per Cycle?

推荐答案

TL:DR

Intel Core,Nehalem和Sandybridge / IvyBridge:最多5个IPC,包括 1个宏融合的cmp + branch ,可将5条指令转换为4个融合域uops,其余为单联指令。 (其中2个可以是微融合存储或load + ALU。)

Intel Core, Nehalem, and Sandybridge / IvyBridge: a maximum of 5 IPC, including 1 macro-fused cmp+branch to get 5 instructions into 4 fused-domain uops, and the rest being single-uop instruction. (up to 2 of these can be micro-fused store or load+ALU.)

具有第9代,最多可以使用6个指令来实现:两对宏可熔ALU + branch 指令和两条指令被解码为两个潜在的微融合微指令。根据 unfused-domain uop吞吐量为每个时钟7 uops。 rel = nofollow noreferrer>我在Skylake上的测试。

Haswell up to 9th Gens: a maximum of 6 instructions per cycle can be achieved using two pairs of macro-fusable ALU+branch instructions and two instructions that are decoded into two potentially micro-fused uops. The max unfused-domain uop throughput is 7 uops per clock, according to my testing on Skylake..

早期的P6-系列:奔腾Pro / PII / PIII和奔腾M。奔腾4:使用3条已解码为3微码的指令,每个周期最多可实现3条指令。 (无宏融合,并进行3宽解码和发布)。

Early P6-family: Pentium Pro/PII/PIII, and Pentium M. Also Pentium 4: a maximum of 3 instructions per cycle can be achieved using 3 instructions that are decoded into 3 uops. (No macro-fusion, and 3-wide decode and issue).

由于前端带宽增加了5微秒,Sunny Cove的最大IPC可能是7

The max IPC on Sunny Cove may be 7, thanks to increased front-end bandwidth of 5 uops per clock.

来源: Agner Fog的microarch pdf和说明表。另请参见 x86 标签Wiki的问题。

Source: Agner Fog's microarch pdf and instruction tables. Also see the x86 tag wiki.

Intel Core2和更高版本中的乱序管道可以每个时钟发出/重命名4个融合域uops。这是瓶颈。宏融合会将 cmp / jcc 组合到单个uop中,但是每个解码块只能发生一次。 (直到Haswell)。

The out-of-order pipeline in Intel Core2 and later can issue/rename 4 fused-domain uops per clock. This is the bottleneck. Macro-fusion will combine a cmp / jcc into a single uop, but this can only happen once per decode block. (Until Haswell).

也要解码(最多4条指令转换为具有4-1-1-1模式的多达7个uops)是之前的另一个重要瓶颈SnB系列中的uop缓存。多uup指令必须在第一个插槽中进行解码。有关Nehalem的潜在瓶颈的更多信息,请参阅Agner Fog的微体系结构指南。

Also decode (up to 4 instructions into up-to-7 uops with a 4-1-1-1 pattern) is another important bottleneck before the uop-cache in SnB-family. Multi-uop instructions have to decode in the first "slot". See Agner Fog's microarch guide for much more about the potential bottlenecks in Nehalem.

Nehalem InstLatx64 显示 nop 令人惊讶地仅具有0.33c的吞吐量,而不是0.25,但是根据 https://www.uops.info/table.html 这是因为 nop 在Sandybridge之前需要CPU中的ALU执行单元。 Agner Fog说他没有发现Nehalem的退休瓶颈。

Nehalem InstLatx64 shows that nop surprisingly only has 0.33c throughput, not 0.25, but it turns out according to https://www.uops.info/table.html that's because nop needs an ALU execution unit in CPUs before Sandybridge. Agner Fog says he didn't detect a retirement bottleneck on Nehalem.

即使您可以安排这样的事情,每4微秒一个以上的宏融合对也处于循环中,Nehalem每个时钟(端口5)的吞吐量仅为一个融合的测试和分支uop。因此,即使每个时钟没有被使用,每个时钟也不能维持一个以上的宏融合比较和分支。 (Haswell可以在端口0或端口6上运行未使用的分支)。

Even if you could arrange things so more than one macro-fused pair per 4 uops was in a loop, Nehalem has a throughput of only one fused test-and-branch uop per clock (port 5). So it couldn't sustain more than one macro-fused compare-and-branch per clock even if some of them are not-taken. (Haswell can run not-taken branches on port 0 or port 6).

;; Should run at one iteration per clock
.l:
    mov   edx, [rsi]    ; doesn't need an ALU uop.  A store would work here, too, but a NOP need an ALU port on Nehalem.
    add   eax, edx
    inc   rsi
    cmp   rsi, rdi          ; macro-fuses
    jb   .l                 ; with this, into 1 cmp+branch uop

为了便于测试并消除缓存/内存瓶颈,您可以将其每次更改为从同一位置加载,而不是在寻址模式下使用循环计数器。 (只要避免了过多的冷寄存器导致的寄存器读取停顿。)

For ease of testing, and remove cache/memory bottlenecks, you could change it to load from the same location every time, instead of using the loop counter in the addressing mode. (As long as you avoid register-read stalls from too many cold registers.)

请注意,Haswell之前的uarch只有三个ALU端口。但是 mov 加载或存储占用流水线带宽,因此具有4级发行/重命名的好处。前端发出问题的速度比乱序内核可以执行的速度还要快,因此在调度程序中总是存在要排队的工作缓冲区,因此它可以找到指令级并行性并尽早开始进行将来的加载。

Note that pre-Haswell uarches only have three ALU ports. But mov loads or stores take pipeline bandwidth so there's a benefit to having 4-wide issue/rename. It's also useful for the front-end to be able to issue faster than the out-of-order core can execute, so there is always a buffer of work to do queued up in the scheduler, so it can find the instruction-level parallelism and get started on future loads early, and stuff like that.

我认为除了加载/存储(包括 push / pop 多亏了堆栈引擎), fxchg 可能是唯一不需要的融合域uop Nehalem的ALU端口。也许实际上是这样,例如 nop 。在SnB系列uarches上, x相同,相同在重命名/颁发阶段处理,有时还会通过reg-reg mov s(IvB及更高版本)。 nop 也从未执行过,这与Nehalem不同,因此SnB / IvB的 nop 吞吐量为0.25c。有3个ALU端口。

I think other than load/store (including push/pop thanks to the stack engine), fxchg might be the only fused-domain uop that doesn't need an ALU port in Nehalem. Or maybe it actually does, like nop. On SnB-family uarches, xor same,same is handled in the rename/issue stage, and sometimes also reg-reg movs (IvB and later). nop is also never executed, unlike on Nehalem, so SnB/IvB have 0.25c throughput for nop even though they only have 3 ALU ports.

这篇关于英特尔Nehalem微体系结构可以实现的最大IPC是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆