英特尔Nehalem微体系结构可以实现的最大IPC是多少？ [英] What is the maximum possible IPC can be achieved by Intel Nehalem Microarchitecture?

查看：113 发布时间：2020/10/11 0:12:19 x86 intel cpu-architecture nehalem

本文介绍了英特尔Nehalem微体系结构可以实现的最大IPC是多少？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

是否估计了英特尔Nehalem架构可以实现的每个周期的最大指令数？另外，影响每个周期最大指令数的瓶颈是什么？

Is there an estimation for the maximum Instructions Per Cycle achievable by the Intel Nehalem Architecture? Also, what is the bottleneck that effects the maximum Instructions Per Cycle?

推荐答案

TL：DR ：

Intel Core，Nehalem和Sandybridge / IvyBridge：最多5个IPC，包括 1个宏融合的cmp + branch ，可将5条指令转换为4个融合域uops，其余为单联指令。（其中2个可以是微融合存储或load + ALU。）

Intel Core, Nehalem, and Sandybridge / IvyBridge: a maximum of 5 IPC, including 1 macro-fused cmp+branch to get 5 instructions into 4 fused-domain uops, and the rest being single-uop instruction. (up to 2 of these can be micro-fused store or load+ALU.)

具有第9代，最多可以使用6个指令来实现：两对宏可熔ALU + branch 指令和两条指令被解码为两个潜在的微融合微指令。根据 unfused-domain uop吞吐量为每个时钟7 uops。 rel = nofollow noreferrer>我在Skylake上的测试。。

Haswell up to 9th Gens: a maximum of 6 instructions per cycle can be achieved using two pairs of macro-fusable ALU+branch instructions and two instructions that are decoded into two potentially micro-fused uops. The max unfused-domain uop throughput is 7 uops per clock, according to my testing on Skylake..

早期的P6-系列：奔腾Pro / PII / PIII和奔腾M。奔腾4：使用3条已解码为3微码的指令，每个周期最多可实现3条指令。（无宏融合，并进行3宽解码和发布）。

Early P6-family: Pentium Pro/PII/PIII, and Pentium M. Also Pentium 4: a maximum of 3 instructions per cycle can be achieved using 3 instructions that are decoded into 3 uops. (No macro-fusion, and 3-wide decode and issue).

由于前端带宽增加了5微秒，Sunny Cove的最大IPC可能是7

The max IPC on Sunny Cove may be 7, thanks to increased front-end bandwidth of 5 uops per clock.

来源： Agner Fog的microarch pdf和说明表。另请参见 x86 标签Wiki的问题。

Source: Agner Fog's microarch pdf and instruction tables. Also see the x86 tag wiki.

Intel Core2和更高版本中的乱序管道可以每个时钟发出/重命名4个融合域uops。这是瓶颈。宏融合会将 cmp / jcc 组合到单个uop中，但是每个解码块只能发生一次。（直到Haswell）。

The out-of-order pipeline in Intel Core2 and later can issue/rename 4 fused-domain uops per clock. This is the bottleneck. Macro-fusion will combine a cmp / jcc into a single uop, but this can only happen once per decode block. (Until Haswell).

也要解码（最多4条指令转换为具有4-1-1-1模式的多达7个uops）是之前的另一个重要瓶颈SnB系列中的uop缓存。多uup指令必须在第一个插槽中进行解码。有关Nehalem的潜在瓶颈的更多信息，请参阅Agner Fog的微体系结构指南。

Also decode (up to 4 instructions into up-to-7 uops with a 4-1-1-1 pattern) is another important bottleneck before the uop-cache in SnB-family. Multi-uop instructions have to decode in the first "slot". See Agner Fog's microarch guide for much more about the potential bottlenecks in Nehalem.

Nehalem InstLatx64 显示 nop 令人惊讶地仅具有0.33c的吞吐量，而不是0.25，但是根据 https://www.uops.info/table.html 这是因为 nop 在Sandybridge之前需要CPU中的ALU执行单元。 Agner Fog说他没有发现Nehalem的退休瓶颈。

Nehalem InstLatx64 shows that nop surprisingly only has 0.33c throughput, not 0.25, but it turns out according to https://www.uops.info/table.html that's because nop needs an ALU execution unit in CPUs before Sandybridge. Agner Fog says he didn't detect a retirement bottleneck on Nehalem.

即使您可以安排这样的事情，每4微秒一个以上的宏融合对也处于循环中，Nehalem每个时钟（端口5）的吞吐量仅为一个融合的测试和分支uop。因此，即使每个时钟没有被使用，每个时钟也不能维持一个以上的宏融合比较和分支。（Haswell可以在端口0或端口6上运行未使用的分支）。

Even if you could arrange things so more than one macro-fused pair per 4 uops was in a loop, Nehalem has a throughput of only one fused test-and-branch uop per clock (port 5). So it couldn't sustain more than one macro-fused compare-and-branch per clock even if some of them are not-taken. (Haswell can run not-taken branches on port 0 or port 6).

;; Should run at one iteration per clock
.l:
    mov   edx, [rsi]    ; doesn't need an ALU uop.  A store would work here, too, but a NOP need an ALU port on Nehalem.
    add   eax, edx
    inc   rsi
    cmp   rsi, rdi          ; macro-fuses
    jb   .l                 ; with this, into 1 cmp+branch uop

为了便于测试并消除缓存/内存瓶颈，您可以将其每次更改为从同一位置加载，而不是在寻址模式下使用循环计数器。（只要避免了过多的冷寄存器导致的寄存器读取停顿。）

For ease of testing, and remove cache/memory bottlenecks, you could change it to load from the same location every time, instead of using the loop counter in the addressing mode. (As long as you avoid register-read stalls from too many cold registers.)

请注意，Haswell之前的uarch只有三个ALU端口。但是 mov 加载或存储占用流水线带宽，因此具有4级发行/重命名的好处。前端发出问题的速度比乱序内核可以执行的速度还要快，因此在调度程序中总是存在要排队的工作缓冲区，因此它可以找到指令级并行性并尽早开始进行将来的加载。

Note that pre-Haswell uarches only have three ALU ports. But mov loads or stores take pipeline bandwidth so there's a benefit to having 4-wide issue/rename. It's also useful for the front-end to be able to issue faster than the out-of-order core can execute, so there is always a buffer of work to do queued up in the scheduler, so it can find the instruction-level parallelism and get started on future loads early, and stuff like that.

我认为除了加载/存储（包括 push / pop 多亏了堆栈引擎）， fxchg 可能是唯一不需要的融合域uop Nehalem的ALU端口。也许实际上是这样，例如 nop 。在SnB系列uarches上， x相同，相同在重命名/颁发阶段处理，有时还会通过reg-reg mov s（IvB及更高版本）。 nop 也从未执行过，这与Nehalem不同，因此SnB / IvB的 nop 吞吐量为0.25c。有3个ALU端口。


I think other than load/store (including push/pop thanks to the stack engine), fxchg might be the only fused-domain uop that doesn't need an ALU port in Nehalem.  Or maybe it actually does, like nop.  On SnB-family uarches, xor same,same is handled in the rename/issue stage, and sometimes also reg-reg movs (IvB and later).  nop is also never executed, unlike on Nehalem, so SnB/IvB have 0.25c throughput for nop even though they only have 3 ALU ports.

                        这篇关于英特尔Nehalem微体系结构可以实现的最大IPC是多少？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

英特尔Nehalem微体系结构可以实现的最大IPC是多少？ [英] What is the maximum possible IPC can be achieved by Intel Nehalem Microarchitecture?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

英特尔Nehalem微体系结构可以实现的最大IPC是多少？ [英] What is the maximum possible IPC can be achieved by Intel Nehalem Microarchitecture?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭