MIPS 如何在不停止的情况下处理前一条 ALU 指令的分支? [英] How does MIPS I handle branching on the previous ALU instruction without stalling?

查看:36
本文介绍了MIPS 如何在不停止的情况下处理前一条 ALU 指令的分支?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

 addiu $6,$6,5bltz 6 美元,L5 美元没有...$L5:

这如何安全而不会停顿,这是经典 MIPS 甚至无法做到的,除非缓存未命中?(MIPS 最初代表没有互锁流水线级的微处理器,并且有一个加载延迟槽而不是互锁.)

原始 MIPS I 是经典的 5 级 RISC IF ID EX MEM WB 设计,它通过 一个单一的分支延迟槽,通过在 ID 阶段及早检查分支条件(更正:这是错误,去阅读这个答案;不要被基于这个错误前提的问题中的其他细节所误导).这就是为什么它仅限于等于/不等于或符号位检查,例如 lt 或 ge 零,而不是两个需要通过加法器进行进位传播的寄存器之间的 lt.

这不是说分支需要比 ALU 指令早一个周期准备好输入吗? bltz 在与 addiu 相同的周期中进入 ID 阶段 进入 EX.

MIPS I(又名 R2000)使用来自 EX 输出的绕过转发到 EX 输入,因此普通整数 ALU 指令(如addu/xor 链)具有单周期延迟并且可以在连续周期中运行.


MIPS 代表没有互锁流水线阶段的微处理器,所以它不会检测到 RAW 危害;代码必须避免它们.(因此,第一代 MIPS 上的加载延迟槽,在这种情况下,MIPS II 添加了互锁来停止,使首字母缩略词 :P 无效).

但我从来没有看到任何关于提前多条指令计算分支条件以避免停顿的讨论.(addiu/bltz 示例由 MIPS gcc5.4 -O3 -march=mips1 在 Godbolt 上,确实尊重加载延迟槽,如果需要,用 nop 填充.)


它是否使用某种技巧,例如在时钟的下降沿读取 EX 读取输入,并且 ID 在上升沿之前不需要转发寄存器值?(随着 EX 足够早地产生结果以使其起作用)

我想如果时钟速度的上限足够低以使缓存访问是单周期的,那将是有意义的.

MIPS 中的停顿或冒泡 声称 lw+ 加载结果上的 beq 需要 2 个停顿周期,因为它无法转发.这对于实际的 MIPS I 来说是不准确的(除非 gcc 有问题).不过,它确实提到了半个时钟周期,允许在同一整个周期内写入一个值然后从寄存器文件中读取.

解决方案

TL:DR: Classic MIPS I在EX的前半个周期检查分支条件,所以转发toem> 他们并不特别.

IF 只需要第二个周期的地址,EX 就可以转发.

这些因素结合起来只给出了 1 个分支延迟周期(被 1 个延迟槽隐藏),对于依赖于前一个 ALU 指令的分支没有问题.

<小时>

在 MIPS I (R2000) 上运行 sltu/beq 绝对安全.这被列为 bgeu 伪指令的扩展,例如,在真正的 MIPS 手册和书籍中,没有任何关于它在 MIPS R2000 或任何其他 MIPS 上不安全的警告.

GCC 在实践中使用类似的序列,即使 march=mips1 尊重真实 MIPS R2000 的加载延迟槽和其他特性.

<小时>

MIPS 的 IF 直到时钟周期的第二个半周期才需要地址,允许 EX 足够快地生成它.

来自 参见 MIPS Run 作者:Dominic Sweetman,(涵盖 MIPS I 到 MIPS IV),章节 1.5.1 指令约束

<块引用>

我们稍后会看到,高效的条件分支意味着关于是否分支的决定只需要压缩到一半流水线阶段;该架构通过保持分支决策测试非常简单来提供帮助.所以条件分支(在 MIPS 中)测试单个注册符号/零或一对相等的寄存器.

他们的图 1.3:流水线和分支延迟显示了在 EX 前半部分计算的分支条件,并在 IF 的后半部分使用,总分支延迟仅为 1 个周期/流水线阶段 (ID)/指令.IF 直到时钟周期的第二个半周期才真正开始.(并继续进入 ID.ID 的实际解码/寄存器获取只需要一个时钟周期的最后一部分.)>

这与我在问题中建议的最终结果相同(检查 ID 末尾的分支条件),除了它只需要 EX -> EX 转发以根据前一个 ALU 指令的结果进行分支.

也许我记错或误解了我之前读到的有关半周期分支决策的内容.这个半周期的事情很可能正是我记得看到的.

进一步引用参见 MIPS Run 1.5.5 程序员可见的流水线效果

<块引用>

• 延迟分支:[第一段解释了分支延迟槽]

如果硬件没有做任何特别的事情,决定分支或不,与分支目标地址一起,会出现在最后ALU 管道阶段——及时获取分支目标指令而不是下一条指令,而是两条.但是分支很重要足以证明特殊处理是合理的,您可以从图 1.3 中看到 [如上所述]通过 ALU 提供了一条特殊的路径,使分支地址提前半个时钟周期可用.加上取指令阶段的奇数半时钟周期移位,这意味着可以及时取到分支目标成为下一个,所以硬件运行分支指令,然后是分支延迟槽指令,以及然后是分支目标——没有其他延迟.

... [不要浪费你的分支延迟槽]

... [许多 MIPS 汇编程序会在安全的情况下为您重新排序指令,以隐藏分支延迟]

See MIPS Run 由 John L. Hennessy 撰写的前言,MIPS Technologies 等创始人等.这并不能证明他在书中签署了所有内容都是准确的,但这是很好的证据,表明该书对 MIPS 如何管理这一技巧的描述是准确的.

它很容易理解并且 100% 可信;我们已经知道数据缓存具有单周期获取延迟(在 EX 阶段的地址生成之后).

        addiu   $6,$6,5
        bltz    $6,$L5
        nop
        ...
$L5:

How is this safe without stalling, which classic MIPS couldn't even do, except on cache miss? (MIPS originally stood for Microprocessor Without Interlocked Pipeline Stages, and had a load delay slot instead of interlocking.)

Original MIPS I is a classic 5-stage RISC IF ID EX MEM WB design that hides all of its branch latency with a single branch-delay slot by checking branch conditions early, in the ID stage (correction: this was the mistake, go read this answer; don't be misled by the rest of the details in the question based on this false premise). Which is why it's limited to equal/not-equal, or sign-bit checks like lt or ge zero, not lt between two registers that would need carry-propagation through an adder.

Doesn't this mean that branches need their input ready a cycle earlier than ALU instructions? The bltz enters the ID stage in the same cycle that addiu enters EX.

MIPS I (aka R2000) uses bypass forwarding from EX-output to EX-input so normal integer ALU instructions (like a chain of addu/xor) have single-cycle latency and can run in consecutive cycles.


MIPS stands for "Microprocessor without Interlocked Pipeline Stages", so it doesn't detect RAW hazards; code has to avoid them. (Hence load-delay slots on first-gen MIPS, with MIPS II adding interlocks to stall in that case, invalidating the acronym :P).

But I never see any discussion of calculating the branch condition multiple instructions ahead to avoid a stall. (The addiu/bltz example was emitted by MIPS gcc5.4 -O3 -march=mips1 on Godbolt, which does respect load-delay slots, filling with nop if needed.)


Does it use some kind of trick like EX reading inputs on the falling edge of the clock, and ID not needing forwarded register values until the rising edge? (With EX producing its results early enough for that to work)

I guess that would make sense if the clock speed is capped low enough for cache access to be single-cycle.

Stalling or bubble in MIPS claims that lw + a beq on the load result needs 2 stall cycles because it can't forward. That's not accurate for actual MIPS I (unless gcc is buggy). It does mention half clock cycles, though, allowing a value to be written and then read from the register file in the same whole cycle.

解决方案

TL:DR: Classic MIPS I checks branch conditions in the first half cycle of EX, so forwarding to them is not special.

IF only needs the address in the 2nd half of a cycle so EX can forward to it.

These factors combine to give only 1 cycle of branch latency (hidden by 1 delay slot), with no problem for branches that depend the previous ALU instruction.


It was definitely safe to run sltu / beq on MIPS I (R2000). That's listed as the expansion for the bgeu pseudo-instruction, for example, in real MIPS manuals and books with no caveat about it being unsafe on MIPS R2000 or any other MIPS.

GCC uses sequences like that in practice even with march=mips1 which respects load-delay slots and other features of real MIPS R2000.


MIPS's IF doesn't need an address until the 2nd half of a clock cycle, allowing EX to produce it quickly enough.

From See MIPS Run by Dominic Sweetman, (covering MIPS I through MIPS IV), Chapter 1.5.1 Constraints on Instructions

We’ll see later that efficient conditional branching means that the decision about whether to branch or not has to be squeezed into only half a pipeline stage; the architecture helps out by keeping the branch decision tests very simple. So conditional branches (in MIPS) test a single register for sign/zero or a pair of registers for equality.

Their Figure 1.3: The pipeline and branch delays shows the branch condition being calculated in the first half of EX, and used in the 2nd half of IF, for a total branch latency of only 1 cycle / pipeline stage (ID) / instruction. IF doesn't actually start until the 2nd half of a clock cycle. (And continues into ID. The actual decode/register-fetch of ID only takes the last fraction of a clock cycle.)

That has the same end result as what I suggested in the question (check branch condition by the end of ID), except it only requires EX -> EX forwarding to branch on the result of the previous ALU instruction.

Perhaps I was misremembering or misinterpreting something I'd read previously about the half-cycle branch-decision. This half-cycle thing might well be exactly what I remembered seeing.

Further quoting See MIPS Run 1.5.5 Programmer-Visible Pipeline Effects

• Delayed branches: [first paragraph explains the branch-delay slot]

If nothing special was done by the hardware, the decision to branch or not, together with the branch target address, would emerge at the end of the ALU pipestage — in time to fetch the branch target instruction instead of the next instruction but two. But branches are important enough to justify special treatment, and you can see from Figure 1.3 [described above] that a special path is provided through the ALU to make the branch address available half a clock cycle early. Together with the odd half-clock-cycle shift of the instruction fetch stage, that means that the branch target can be fetched in time to become the next but one, so the hardware runs the branch instruction, then the branch delay slot instruction, and then the branch target — with no other delays.

... [don't waste your branch-delay slots]

... [many MIPS assemblers will reorder instructions for you if it's safe, to hide branch delay]

See MIPS Run has a foreword by John L. Hennessy, Founder of MIPS Technologies etc. etc.. That's not proof he signed off on everything in the book being accurate, but it's good evidence that the book's description of how MIPS managed this trick is accurate.

It's easily understandable and 100% plausible; we already know the data cache has single-cycle fetch latency (after address-generation in the EX stage).

这篇关于MIPS 如何在不停止的情况下处理前一条 ALU 指令的分支?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆