当 Skylake CPU 错误预测分支时究竟会发生什么? [英] What exactly happens when a skylake CPU mispredicts a branch?

查看:20
本文介绍了当 Skylake CPU 错误预测分支时究竟会发生什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图详细了解当分支预测错误时,skylake CPU 管道各个阶段的指令会发生什么情况,以及来自正确分支目标的指令开始执行的速度.

所以让我们将这里的两个代码路径标记为红色(预测的,但未实际采用的)和绿色(采用的,但未预测的).所以问题是:1. 在红色指令开始被丢弃之前,分支必须通过流水线多远(以及它们在流水线的哪个阶段被丢弃)?2. 绿色指令多久可以开始执行(就分支到达的流水线阶段而言)?

我查看了 Agner Fogg 的文档和多组讲义,但没有发现这些要点的清晰性.

解决方案

分支执行单元(在端口 0 和 6 上)是实际检查条件或间接分支的 FLAGS 或间接分支地址的单元.我认为只要执行单元发现它就可以开始恢复,而无需等到它退休.(其中一些是我最好的猜测/理解,不一定有英特尔优化手册的支持.)

分支预测 + 推测执行将数据依赖与控制依赖解耦,但分支 uop 本身确实对 EFLAGS 或间接地址输入有数据依赖.

p0 上的分支单元只能运行预测不采用的 JCC uops(或宏融合 JCC uops),但这些都是常见的.p6 上的分支单元是处理分支的主"单元.

<小时>

对于直接分支(jmp rel8/rel32/call rel32),可以在解码时检查预测并重新引导获取阶段,可能会停止前端但我认为永远不需要在后端触发任何类型的恢复.永远不会为直接无条件分支发出来自错误路径的 Uop.有用于管道重新转向的性能计数器.

<小时>

分支错误预测可以通过分支顺序缓冲区快速恢复,这与通常在异常情况下回滚到退休状态不同:当中断发生时,管道中的指令会发生什么?.有关管道如何将一切视为投机性直到退休的更多信息,请参阅无序执行与推测执行.

根据 David Kanter 的 Sandybridge microarch 文章:

<块引用>

Nehalem 增强了从分支预测错误中恢复的能力,这种预测已被转移到 Sandy Bridge.一旦发现分支预测错误,一旦知道正确的路径,内核就能够重新开始解码,同时无序机器从错误推测的路径中清除微指令.以前,在管道完全刷新之前,解码不会恢复.

这是由分支顺序缓冲区启用的快速恢复",该缓冲区根据条件和间接分支指令对 reg 重命名状态进行快照,即使在正常程序中预计也会错误预测.但是异常和内存排序机器清除更昂贵.它们确实会发生(尤其是页面错误),但它们更罕见且更难优化.

快速恢复的关键在于已经在ROB + RS(调度器)中的错误预测分支之前的uop可以继续执行后面的uop被丢弃 并且前端重新转向正确的地址.因此,如果 JCC uop 的输入足够早地准备好,如果 CPU 可以在恢复时咀嚼很长的依赖链,则可以隐藏大部分分支未命中惩罚.例如从具有适当长度循环的循环退出时的错误预测带有 dep 链,或除总 uop 吞吐量或端口 6 瓶颈之外的任何瓶颈.请参阅通过提前计算条件来避免停顿管道>

如果没有快速恢复,我认为 所有 ROB 中的 uops 将被丢弃(即所有未退休的 uops).这里可能有一些中间立场,比如在 ROB 中但已经离开调度程序的分支之前保留已经执行的 uops.我不知道 Merom/Conroe 到底做了什么.

<小时>

相关:表征分支错误预测惩罚是关于分支未命中和长缓存未命中如何与 ROB 交互的有趣论文.它基于简化的管道模型,但在我看来,它的发现可能适用于 Skylake.

I'm trying to understand in detail what happens to instructions in the various stages of the skylake CPU pipeline when a branch is mis-predicted, and how quickly instructions from the correct branch destination can start executing.

So lets label the two code paths here as red (the one predicted, but not actually taken) and green (the one taken, but not predicted). So questions are: 1. How far through the pipeline does the branch have to get before red instructions start being discarded (and at what stage(s) of the pipeline are they discarded)? 2. How soon (in terms of the pipeline stage reached by the branch) can green instructions start being executed?

I've looked at Agner Fogg's documents and numerous sets of lecture notes, but found no clarity on these points.

解决方案

The branch execution unit (on ports 0 and 6) are what actually check the FLAGS or indirect-branch address for conditional or indirect branches. I think that recovery begins as soon as an execution unit discovers it, without waiting for it to reach retirement. (Some of this is my best guess / understanding, not necessarily backed up by Intel's optimization manual.)

Branch prediction + speculative execution decouples data dependencies from control dependencies, but the branch uop itself does have a data dependency on EFLAGS or an indirect address input.

The branch unit on p0 can only run predicted-not-taken JCC uops (or macro-fused JCC uops), but those are common. The branch unit on p6 is the "main" one which handles taken branches.


For direct branches (jmp rel8/rel32 / call rel32), prediction can be checked on decode and re-steer the fetch stages, maybe stalling the front-end but I think never needing to trigger any kind of recovery in the back end. Uops from the wrong path would never be issued for direct unconditional branches. There are perf counters for pipeline re-steer.


Branch mispredicts have fast recovery with a branch-order-buffer, unlike the usual rollback to retirement state on exceptions: When an interrupt occurs, what happens to instructions in the pipeline?. For more about how the pipeline treats everything as speculative until retirement, see Out-of-order execution vs. speculative execution.

According to David Kanter's Sandybridge microarch writeup:

Nehalem enhanced the recovery from branch mispredictions, which has been carried over into Sandy Bridge. Once a branch misprediction is discovered, the core is able to restart decoding as soon as the correct path is known, at the same time that the out-of-order machine is clearing out uops from the wrongly speculated path. Previously, the decoding would not resume until the pipeline was fully flushed.

This is the "fast recovery" enabled by a branch-order buffer that snapshots reg-renaming state on conditional and indirect branch instructions, which are expected to mispredict even in normal programs. But exceptions and memory-ordering machine clears are more expensive. They do happen (especially page faults), but they're rarer and harder to optimize for.

The key point of fast recovery is that uops from before the mispredicted branch which are already in the ROB + RS (scheduler) can keep executing while later uops are being discarded and the front-end re-steered to the correct address. So if the inputs to a JCC uop are ready early enough, most of the branch-miss penalty can be hidden if there's a long dependency chain the CPU can be chewing on while recovering. e.g. The mispredict on exit from a loop with a decent-length loop carried dep chain, or any bottleneck other than total uop throughput or a port 6 bottleneck. See Avoid stalling pipeline by calculating conditional early

Without fast recovery, I think all uops in the ROB would be discarded (i.e. all not-retired uops). There might be some middle ground here, like keeping already-executed uops from before the branch that were in the ROB but had left the scheduler. I don't know what Merom/Conroe did exactly.


Related: Characterizing the Branch Misprediction Penalty is an interesting paper about how branch misses and long cache misses interact with the ROB. It's based on a simplified pipeline model, but it looks to me like its findings probably apply to Skylake.

这篇关于当 Skylake CPU 错误预测分支时究竟会发生什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆