当Skylake CPU错误预测分支时会发生什么? [英] What exactly happens when a skylake CPU mispredicts a branch?

查看:218
本文介绍了当Skylake CPU错误预测分支时会发生什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图详细了解当分支预测错误时在Skylake CPU管道的各个阶段中的指令会发生什么,以及从正确的分支目标开始执行指令的速度如何。

I'm trying to understand in detail what happens to instructions in the various stages of the skylake CPU pipeline when a branch is mis-predicted, and how quickly instructions from the correct branch destination can start executing.

因此,我们在这里将两个代码路径分别标记为红色(预测但未实际采用)和绿色(预测但未实际采用)。这样的问题是:
1.在红色指令开始被丢弃之前,分支必须经过管道多远(以及在管道的哪个阶段被丢弃)?
2.绿色指令可以在多长时间内开始执行(根据分支所达到的流水线阶段)?

So lets label the two code paths here as red (the one predicted, but not actually taken) and green (the one taken, but not predicted). So questions are: 1. How far through the pipeline does the branch have to get before red instructions start being discarded (and at what stage(s) of the pipeline are they discarded)? 2. How soon (in terms of the pipeline stage reached by the branch) can green instructions start being executed?

我查看了Agner Fogg的文档以及大量的讲义,但在这些方面并不清楚。

I've looked at Agner Fogg's documents and numerous sets of lecture notes, but found no clarity on these points.

推荐答案

分支执行单元(在端口0和6上) )是实际检查FLAGS或间接分支地址是否有条件或间接分支的内容。我认为,恢复将在执行单元发现后立即开始,而无需等待其退役。 (这是我的最佳猜测/理解,不一定得到英特尔优化手册的支持。)

The branch execution unit (on ports 0 and 6) are what actually check the FLAGS or indirect-branch address for conditional or indirect branches. I think that recovery begins as soon as an execution unit discovers it, without waiting for it to reach retirement. (Some of this is my best guess / understanding, not necessarily backed up by Intel's optimization manual.)

分支预测+投机执行将数据依赖与控制依赖分离开来,但是

Branch prediction + speculative execution decouples data dependencies from control dependencies, but the branch uop itself does have a data dependency on EFLAGS or an indirect address input.

p0上的分支单元只能运行未预测的JCC uops(或宏-融合的JCC微指令),但这些是常见的。 p6上的分支单位是主要分支,用于处理已采取的分支。

The branch unit on p0 can only run predicted-not-taken JCC uops (or macro-fused JCC uops), but those are common. The branch unit on p6 is the "main" one which handles taken branches.

对于直接分支( jmp rel8 / rel32 / call rel32 ),可以在解码时检查预测并重新引导取回阶段,也许会使前端但我认为永远不需要在后端触发任何类型的恢复。来自错误路径的Uop永远不会为直接无条件分支发出。

For direct branches (jmp rel8/rel32 / call rel32), prediction can be checked on decode and re-steer the fetch stages, maybe stalling the front-end but I think never needing to trigger any kind of recovery in the back end. Uops from the wrong path would never be issued for direct unconditional branches. There are perf counters for pipeline re-steer.

分支机构的错误预测随着分支机构订单的恢复而迅速恢复。缓冲区,与通常在例外情况下回退到退休状态不同:发生中断时,管道中的指令会发生什么?。有关管道如何将一切视为投机直到退休的信息,请参见乱序执行与推测性执行

Branch mispredicts have fast recovery with a branch-order-buffer, unlike the usual rollback to retirement state on exceptions: When an interrupt occurs, what happens to instructions in the pipeline?. For more about how the pipeline treats everything as speculative until retirement, see Out-of-order execution vs. speculative execution.

根据大卫·坎特(David Kanter)的桑迪布里奇(Sandybridge)微拱文章


< href = https://www.realworldtech.com/nehalem/4/ rel = noreferrer> Nehalem 增强了分支错误预测的恢复,该错误预测已被转移到Sandy Bridge中。一旦发现分支预测错误,内核就可以在知道正确的路径后立即重新开始解码,与此同时无序的机器正在从错误推测的路径中清除错误。以前,直到管道完全刷新后,解码才能恢复。

Nehalem enhanced the recovery from branch mispredictions, which has been carried over into Sandy Bridge. Once a branch misprediction is discovered, the core is able to restart decoding as soon as the correct path is known, at the same time that the out-of-order machine is clearing out uops from the wrongly speculated path. Previously, the decoding would not resume until the pipeline was fully flushed.

这是分支顺序启用的快速恢复缓冲区,用于在条件和间接分支指令上快照重命名状态的快照,即使在正常程序中,这些指令也可能会预测错误。但是异常和内存排序机器清除的成本更高。它们确实会发生(尤其是页面错误),但是它们却很少见,而且难以优化。

This is the "fast recovery" enabled by a branch-order buffer that snapshots reg-renaming state on conditional and indirect branch instructions, which are expected to mispredict even in normal programs. But exceptions and memory-ordering machine clears are more expensive. They do happen (especially page faults), but they're rarer and harder to optimize for.

快速恢复的关键点是 ROB + RS(调度程序)中已经存在错误预测的分支可以继续执行,而随后丢弃的 和前端被重新引导到正确的地址。因此,如果足够早地准备好JCC uop的输入,那么在恢复过程中CPU可能需要咀嚼的依赖链很长的情况下,大部分分支未命中的代价都可以被隐藏。例如错误退出带有适当长度循环的dep链的循环退出,或者除总uop吞吐量或端口6瓶颈以外的任何瓶颈。参见通过计算条件提早避免管道停顿

The key point of fast recovery is that uops from before the mispredicted branch which are already in the ROB + RS (scheduler) can keep executing while later uops are being discarded and the front-end re-steered to the correct address. So if the inputs to a JCC uop are ready early enough, most of the branch-miss penalty can be hidden if there's a long dependency chain the CPU can be chewing on while recovering. e.g. The mispredict on exit from a loop with a decent-length loop carried dep chain, or any bottleneck other than total uop throughput or a port 6 bottleneck. See Avoid stalling pipeline by calculating conditional early

没有快速恢复,我认为ROB中的所有 微指令将被丢弃(即所有未退休的微指令)。这里可能有一些中间立场,例如保留从分支机构开始但已经离开调度程序的分支之前已经执行的指令。我不知道Merom / Conroe到底做了什么。

Without fast recovery, I think all uops in the ROB would be discarded (i.e. all not-retired uops). There might be some middle ground here, like keeping already-executed uops from before the branch that were in the ROB but had left the scheduler. I don't know what Merom/Conroe did exactly.

相关:表征分支失误惩罚是一篇有趣的论文,内容涉及分支未命中和长缓存未命中如何与ROB交互。它基于简化的管道模型,但在我看来,它的发现可能适用于Skylake。

Related: Characterizing the Branch Misprediction Penalty is an interesting paper about how branch misses and long cache misses interact with the ROB. It's based on a simplified pipeline model, but it looks to me like its findings probably apply to Skylake.

这篇关于当Skylake CPU错误预测分支时会发生什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆