我如何将EX的MIPS从EX转发到分支的ID而不会停止? [英] How does MIPS I forward from EX to ID for branches without stalling?

查看:125
本文介绍了我如何将EX的MIPS从EX转发到分支的ID而不会停止?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

        addiu   $6,$6,5
        bltz    $6,$L5
        nop
        ...
$L5:

使用MIPS I安全吗?如果可以,怎么办?

原始MIPS I是经典的5级RISC IF ID EX MEM WB设计,它使用绕过转发到EX输入,因此普通整数ALU指令(如addu/xor链)具有单周期延迟,并且可以连续运行.


MIPS代表没有互锁的管线阶段的微处理器",因此无法检测到RAW危害;代码必须避免它们. (因此,第一代MIPS上的加载延迟插槽,在这种情况下MIPS II添加了互锁来使停顿,使首字母缩写:P无效).

但是我从未见过任何关于计算分支条件的讨论,前面有多条指令以避免停顿. (addiu/bltz示例是由MIPS gcc5.4 -O3 -march=mips1 MIPS中的滑行或气泡声称lw + a beq加载结果的结果需要 2 个停顿周期,因为它无法转发.对于实际的MIPS I,这是不准确的(除非gcc有问题).不过,它确实提到了半个时钟周期,允许在同一整个周期中写入值然后从寄存器文件中读取值.

解决方案

TL:DR:经典MIPS我在EX的前半个周期中检查分支条件,因此将转发给它们并不特殊.

IF仅在周期的后半部分需要该地址,以便EX可以转发给它.

这些因素结合在一起,仅给出了1个分支等待时间周期(被1个延迟时隙所隐藏),而对于依赖于先前ALU指令的分支则没有问题.


在MIPS I(R2000)上运行sltu/beq绝对安全..例如,在真正的MIPS手册和书籍中,这被列为bgeu伪指令的扩展,没有警告说它在MIPS R2000或任何其他MIPS上是不安全的.

即使在march=mips1上,GCC也使用类似的序列,该序列考虑了负载延迟插槽和实际MIPS R2000的其他功能.


MIPS的IF直到一个时钟周期的后半部分才需要一个地址,从而使EX可以足够快地生成它.

来自参见MIPS Run Dominic Sweetman,(涵盖MIPS I至MIPS IV), 1.5.1指令约束

稍后我们将看到有效的条件分支意味着必须将是否分支的决定压缩到一半 管线阶段;该体系结构通过使分支决策测试非常简单而有所帮助.因此,条件分支(在MIPS中)测试单个 为符号/零注册或为相等而注册一对寄存器.

他们的图1.3:流水线和分支延迟显示了在EX的上半部分中计算出的分支条件,在IF的后半部分中使用了分支条件,总的分支等待时间仅为1个周期/流水线阶段(ID)/指令. IF直到一个时钟周期的后半部分才真正开始.(并继续进入ID.ID的实际解码/获取仅占用时钟周期的最后一部分.)

具有与我在问题中建议的结果相同的最终结果(在I​​D的末尾检查分支条件),只是它只需要EX-> EX转发即可分支到上一条ALU指令的结果.

也许我是在回忆或误解了我之前阅读的有关半周期分支决策的内容.这个半周期的事情很可能正是我记得的.

进一步引用请参阅MIPS Run 1.5.5程序员可见的管道效果

•分支延迟:[第一段说明了分支延迟插槽]

如果硬件没有做任何特别的事情,则决定分支或 不会与分支目标地址一起出现在末尾 的ALU管道阶段—及时获取分支目标指令 而不是下一条指令,而是两条.但是分支很重要 足以证明特殊待遇的合理性,您可以从图1.3中看到[如上所述] 通过ALU提供了一条特殊路径,以使分支地址提早半个时钟周期可用. 加上指令提取阶段的奇数个半时钟周期,这意味着可以及时提取分支目标,成为下一个目标,因此硬件 运行转移指令,然后转移转移延迟时隙指令,并 然后是分支目标-没有其他延迟.

... [不要浪费分支延迟插槽]

... [[许多MIPS汇编程序会在安全的情况下为您重新排序说明,以隐藏分支延迟]

请参阅MIPS Run ,其中有John L. Hennessy的前言, MIPS Technologies等的创始人.这不能证明他在书中的所有内容上均正确无误,但有力的证据表明,书中有关MIPS如何管理此技巧的描述是准确的.

这很容易理解,而且100%合理;我们已经知道数据缓存具有单周期获取延迟(在EX阶段生成地址之后).

        addiu   $6,$6,5
        bltz    $6,$L5
        nop
        ...
$L5:

Is that safe on MIPS I? If so, how?

Original MIPS I is a classic 5-stage RISC IF ID EX MEM WB design that hides all of its branch latency with a single branch-delay slot by checking branch conditions early, in the ID stage. (Which is why it's limited to equal/not-equal, or sign-bit checks like lt or ge zero, not lt between two registers that would need carry-propagation through an adder.)

Doesn't this mean that branches need their input ready a cycle earlier than ALU instructions? The bltz enters the ID stage in the same cycle that addiu enters EX.

MIPS I (aka R2000) uses bypass forwarding from EX-output to EX-input so normal integer ALU instructions (like a chain of addu/xor) have single-cycle latency and can run in consecutive cycles.


MIPS stands for "Microprocessor without Interlocked Pipeline Stages", so it doesn't detect RAW hazards; code has to avoid them. (Hence load-delay slots on first-gen MIPS, with MIPS II adding interlocks to stall in that case, invalidating the acronym :P).

But I never see any discussion of calculating the branch condition multiple instructions ahead to avoid a stall. (The addiu/bltz example was emitted by MIPS gcc5.4 -O3 -march=mips1 on Godbolt, which does respect load-delay slots, filling with nop if needed.)


Does it use some kind of trick like EX reading inputs on the falling edge of the clock, and ID not needing forwarded register values until the rising edge? (With EX producing its results early enough for that to work)

I guess that would make sense if the clock speed is capped low enough for cache access to be single-cycle.

Stalling or bubble in MIPS claims that lw + a beq on the load result needs 2 stall cycles because it can't forward. That's not accurate for actual MIPS I (unless gcc is buggy). It does mention half clock cycles, though, allowing a value to be written and then read from the register file in the same whole cycle.

解决方案

TL:DR: Classic MIPS I checks branch conditions in the first half cycle of EX, so forwarding to them is not special.

IF only needs the address in the 2nd half of a cycle so EX can forward to it.

These factors combine to give only 1 cycle of branch latency (hidden by 1 delay slot), with no problem for branches that depend the previous ALU instruction.


It was definitely safe to run sltu / beq on MIPS I (R2000). That's listed as the expansion for the bgeu pseudo-instruction, for example, in real MIPS manuals and books with no caveat about it being unsafe on MIPS R2000 or any other MIPS.

GCC uses sequences like that in practice even with march=mips1 which respects load-delay slots and other features of real MIPS R2000.


MIPS's IF doesn't need an address until the 2nd half of a clock cycle, allowing EX to produce it quickly enough.

From See MIPS Run by Dominic Sweetman, (covering MIPS I through MIPS IV), Chapter 1.5.1 Constraints on Instructions

We’ll see later that efficient conditional branching means that the decision about whether to branch or not has to be squeezed into only half a pipeline stage; the architecture helps out by keeping the branch decision tests very simple. So conditional branches (in MIPS) test a single register for sign/zero or a pair of registers for equality.

Their Figure 1.3: The pipeline and branch delays shows the branch condition being calculated in the first half of EX, and used in the 2nd half of IF, for a total branch latency of only 1 cycle / pipeline stage (ID) / instruction. IF doesn't actually start until the 2nd half of a clock cycle. (And continues into ID. The actual decode/register-fetch of ID only takes the last fraction of a clock cycle.)

That has the same end result as what I suggested in the question (check branch condition by the end of ID), except it only requires EX -> EX forwarding to branch on the result of the previous ALU instruction.

Perhaps I was misremembering or misinterpreting something I'd read previously about the half-cycle branch-decision. This half-cycle thing might well be exactly what I remembered seeing.

Further quoting See MIPS Run 1.5.5 Programmer-Visible Pipeline Effects

• Delayed branches: [first paragraph explains the branch-delay slot]

If nothing special was done by the hardware, the decision to branch or not, together with the branch target address, would emerge at the end of the ALU pipestage — in time to fetch the branch target instruction instead of the next instruction but two. But branches are important enough to justify special treatment, and you can see from Figure 1.3 [described above] that a special path is provided through the ALU to make the branch address available half a clock cycle early. Together with the odd half-clock-cycle shift of the instruction fetch stage, that means that the branch target can be fetched in time to become the next but one, so the hardware runs the branch instruction, then the branch delay slot instruction, and then the branch target — with no other delays.

... [don't waste your branch-delay slots]

... [many MIPS assemblers will reorder instructions for you if it's safe, to hide branch delay]

See MIPS Run has a foreword by John L. Hennessy, Founder of MIPS Technologies etc. etc.. That's not proof he signed off on everything in the book being accurate, but it's good evidence that the book's description of how MIPS managed this trick is accurate.

It's easily understandable and 100% plausible; we already know the data cache has single-cycle fetch latency (after address-generation in the EX stage).

这篇关于我如何将EX的MIPS从EX转发到分支的ID而不会停止?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆