Sandybridge 微架构中的堆栈引擎是什么? [英] What is the stack engine in the Sandybridge microarchitecture?

查看:17
本文介绍了Sandybridge 微架构中的堆栈引擎是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读http://www.realworldtech.com/sandy-bridge/我在理解某些问题时遇到了一些问题:

I am reading http://www.realworldtech.com/sandy-bridge/ and I'm facing some problems in understanding some issues:

专用堆栈指针跟踪器也存在于 Sandy Bridge 中并重命名堆栈指针,消除串行依赖和删除一些 uop.

The dedicated stack pointer tracker is also present in Sandy Bridge and renames the stack pointer, eliminating serial dependencies and removing a number of uops.

专用堆栈指针跟踪器究竟是什么?

对于 Sandy Bridge(和 P4),英特尔仍然使用术语 ROB.但它理解这一点至关重要,在这种情况下,它仅指飞行中 uops 的状态数组

For Sandy Bridge (and the P4), Intel still uses the term ROB. But it is critical to understand that, in this context, it only refers the status array for in-flight uops

这实际上是什么意思?请说清楚.

What does it mean in fact? Please make it clear.

推荐答案

  1. 就像 Agner Fog 的微架构文档解释一样,堆栈引擎处理 rsp+=8/rsp-=8 管道发出阶段的 push/pop/call/ret 部分(在将 uops 发出到核心的乱序 (OoO) 部分之前)).

  1. Like Agner Fog's microarch doc explains, the stack engine handles the rsp+=8 / rsp-=8 part of push/pop / call/ret in the issue stage of the pipeline (before issuing uops into the Out-of-Order (OoO) part of the core).

所以内核的 OoO 执行部分只需要处理加载/存储部分,地址由堆栈引擎生成.当 8 位位移计数器溢出时,或者当 OoO 核心直接需要 rsp 的值(例如 sub rsp, 8, 或 mov [rsp-8], eaxcall, ret, pushpop 通常会导致在 Intel CPU 上插入额外的 uop.AMD CPU 显然不需要额外的同步 uop.

So the OoO execution part of the core only has to handle the load/store part, with an address generated by the stack engine. It occasionally has to insert a uop to sync its offset from rsp when the 8bit displacement counter overflows, or when the OoO core needs the value of rsp directly (e.g. sub rsp, 8, or mov [rsp-8], eax after a call, ret, push or pop typically cause an extra uop to be inserted on Intel CPUs. AMD CPUs apparently don't need extra sync uops).

请注意,Agner 的指令表显示 Pentium-M 和更高版本将 pop reg 解码为仅在加载端口上运行的单个 uop.但是 Pentium II/III 将 pop eax 解码为 2 uops;1 个 ALU 和 1 个负载,因为没有堆栈引擎来处理乱序核心之外的 ESP 调整.除了采取额外的 uops 之外,一长串 push/pop 和 call/ret 会创建对 ESP 的串行依赖,因此在一个值可用于 mov ebp,esp 之前,乱序执行必须咀嚼 ALU uops,或 mov eax, [esp+16] 的地址.

Note that Agner's instruction tables show that Pentium-M and later decode pop reg to a single uop which runs only on the load port. But Pentium II/III decodes pop eax to 2 uops; 1 ALU and 1 load, because there's no stack-engine to handle the ESP adjustment outside of the out-of-order core. Besides taking extra uops, a long chain of push/pop and call/ret creates a serial dependency on ESP so out-of-order execution has to chew through the ALU uops before a value is available for a mov ebp, esp, or an address for mov eax, [esp+16].

<小时>

  1. P6 microarch 系列(PPro 到 Nehalem)将 uop 的输入值直接存储在 ROB 中.在问题/重命名时,冷"寄存器输入从架构寄存器文件读取到 ROB(由于读取端口有限,这可能是一个瓶颈.请参阅寄存器读取停顿).执行完一个 uop 后,将结果写入 ROB 以供其他 uop 读取.当 uops 停用时,架构寄存器文件会使用来自 ROB 的值进行更新.

  1. The P6 microarch family (PPro to Nehalem) stored the input values for a uop directly in the ROB. At issue/rename, "cold" register inputs are read from the architectural register file into the ROB (which can be a bottleneck, due to limited read ports. See register-read stalls). After executing a uop, the result is written into the ROB for other uops to read. The architectural register file is updated with values from the ROB when uops retire.

SnB 系列微架构(和 P4)有一个物理寄存器文件,所以 ROB 存储寄存器编号(即一个间接级别)而不是直接存储数据.Re-Order Buffer 对于 CPU 的那部分仍然是一个很好的名字.

SnB-family microarchitectures (and P4) have a physical register file, so the ROB stores register numbers (i.e. a level of indirection) instead of the data directly. Re-Order Buffer is still an excellent name for that part of the CPU.

请注意,SnB 引入了 AVX,具有 256b 向量.与仅将它们保存在较小的 FP 寄存器文件中相比,使每个 ROB 条目足够大以存储双倍大小的向量可能是不可取的.

Note that SnB introduced AVX, with 256b vectors. Making every ROB entry big enough to store double-size vectors was presumably undesirable compared to only keeping them in a smaller FP register file.

SnB 简化了 uop 格式以节省功耗.不过,这确实导致了 uop 微融合能力的牺牲:解码器和 uop 缓存仍然可以使用 2 寄存器(索引)寻址模式对内存操作数进行微融合,但它们在进入 OOO 核心之前是未分层的".

SnB simplified the uop format to save power. This did lead to a sacrifice in uop micro-fusion capability, though: the decoders and uop-cache can still micro-fuse memory operands using 2-register (indexed) addressing modes, but they're "unlaminated" before issuing into the OOO core.

这篇关于Sandybridge 微架构中的堆栈引擎是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆