在6级标量或超标量MIPS中,有多少条指令因未命中而需要杀死? [英] How many instructions need to be killed on a miss-predict in a 6-stage scalar or superscalar MIPS?

查看:76
本文介绍了在6级标量或超标量MIPS中,有多少条指令因未命中而需要杀死?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用6个阶段的流水线:F D I X0 X1W.有人问我,当发生分支未命中预测时,需要杀死多少条指令.

I am working on a pipeline with 6 stages: F D I X0 X1 W. I am asked how many instructions need to be killed when a branch miss-predict happens.

我想出了4.我想这是因为分支解析发生在X1中,我们将需要杀死分支之后的所有指令.在管道图中,看起来需要杀死流经管道的4条指令.那是对的吗?

I have come up with 4. I think this because the branch resolution happens in X1 and we will need to kill all the instructions that came after the branch. In the pipeline diagram, it looks like it would require killing 4 instructions that are in the process of flowing through the pipeline. Is that correct?

我还被问到,如果管道是三宽超标量,则需要杀死多少人.我不确定这一点.我认为应该是12,因为您一次可以提取3条指令.正确吗?

I am also asked how many need to be killed if the pipeline is a three-wide superscalar. This one I am not sure on. I think that it would be 12 because you can fetch 3 instructions at a time. Is that correct?

推荐答案

杀死分支之后的所有指令

kill all the instructions that came after the branch

如果这是真正的MIPS,则不会.MIPS具有一个分支延迟槽:分支之后的指令始终执行是否采用该分支或不.( jal 的返回地址是延迟槽的末尾,因此它不会执行两次.)

Not if this is a real MIPS. MIPS has one branch-delay slot: The instruction after a branch always executes whether the branch is taken or not. (jal's return address is the end of the delay slot so it doesn't execute twice.)

这足以完全隐藏经典MIPS I(R2000)上的分支延迟的1个周期,该周期使用标量从从时钟周期的后一半开始,将EX时钟周期转换为IF .这就是为什么MIPS分支条件都是简单"的(不需要在整个单词中进行进位传播),例如两个寄存器之间的 beq ,而只有一个操作数的 bgez / bltz 相对于隐式 0 进行有符号2的补码比较.只需检查符号位即可.

This was enough to fully hide the 1 cycle of branch latency on classic MIPS I (R2000), which used a scalar classic RISC 5-stage pipeline. It managed that 1 cycle branch latency by forwarding from the first half of an EX clock cycle to an IF starting in the 2nd half of a clock cycle. This is why MIPS branch conditions are all "simple" (don't need carry propagation through the whole word), like beq between two registers but only one-operand bgez / bltz against an implicit 0 for signed 2's complement comparisons. That only has to check the sign bit.

如果您的管道设计合理,那么您希望它可以解决X0之后的分支问题,因为MIPS ISA已经受到限制,可以使ALU轻松做出低延迟的分支决策.但是显然,您的管道没有经过优化,直到X1结束时分支决策才准备就绪,这违背了使它运行MIPS代码而不是RISC-V或任何其他RISC指令集的目的.

If your pipeline was well-designed, you'd expect it to resolve branches after X0 because the MIPS ISA is already limited to make low-latency branch decision easy for the ALU. But apparently your pipeline is not optimized and branch decisions aren't ready until the end of X1, defeating the purpose of making it run MIPS code instead of RISC-V or whatever other RISC instruction set.

我想出了4.我认为这是因为分支解析发生在X1中,我们将需要杀死分支之后的所有指令.

I have come up with 4. I think this because the branch resolution happens in X1 and we will need to kill all the instructions that came after the branch.

我认为对于普通标量流水线没有分支延迟槽的情况来说,四个周期看起来是正确的.

在该X1周期结束时,前四个流水线阶段的每个阶段都有一条指令,等待在该时钟沿移至下一个阶段.(假设没有其他管道气泡).延迟时隙指令就是其中之一,不需要取消.

At the end of that X1 cycle, there's an instruction in each of the previous 4 pipeline stages, waiting to move to the next stage on that clock edge. (Assuming no other pipeline bubbles). The delay-slot instruction is one of those and doesn't need to be killed.

(除非I-cache丢失了获取延迟槽指令的信息,否则延迟槽指令可能甚至不在流水线中.因此,这不像杀死X0之前的3个阶段甚至杀死它一样简单.除了最旧的前一条指令之外,其他所有指令都不能执行.延迟槽不是免费实现的,这也使异常处理变得复杂.)

(Unless there was an I-cache miss fetching the delay slot instruction, in which case the delay slot instruction might not even be in the pipeline yet. So it's not as simple as killing the 3 stages before X0, or even killing all but the oldest previous instruction in the pipeline. Delay slots are not free to implement, also complicating exception handling.)

因此,需要在从F到I的流水线阶段中终止0..3指令.(如果延迟时隙指令可能处于这些阶段之一,则必须检测到这种特殊情况.'t,例如,I-cache错过等待时间足够长,以至于它处于X0或仍在等待获取,那么管道就可以杀死前三个阶段,并根据X0是否为气泡来执行某些操作.)

So 0..3 instructions need to be killed in pipeline stages from F to I. (If it's possible for the delay-slot instruction to be in one of those stages, you have to detect that special case. If it isn't, e.g. I-cache miss latency long enough that it's either in X0 or still waiting to be fetched, then the pipeline can just kill those first 3 stages and do something based on X0 being a bubble or not.)

我认为应该是12,因为您一次可以提取3条指令

I think that it would be 12 because you can fetch 3 instructions at a time

不.记住分支本身是可以通过管道处理的3条指令之一.在未采用"的情况下,大概是解码阶段会在管道中发送该获取/解码"组中的所有3条指令.

No. Remember the branch itself is one of a group of 3 instructions that can go through the pipeline. In the predict-not-taken case, presumably the decode stage would have sent all 3 instructions in that fetch/decode group down the pipe.

最糟糕的情况是我认为分支是组中的第一条(按程序顺序最旧)指令.然后,必须杀死X1中该组中的1条(或2条没有分支延迟时隙的指令)以及先前阶段中的所有指令.然后(假设没有气泡),您要取消13(或14)条指令,在上一个阶段中要取消3.

The worst case is I think when the branch is the first (oldest in program order) instruction in a group. Then 1 (or 2 with no branch delay slot) instructions from that group in X1 have to be killed, as well as all instructions in previous stages. Then (assuming no bubbles) you're cancelling 13 (or 14) instructions, 3 in each previous stage.

最好的情况是,当分支在3个组中最后一个(按程序顺序从小到大)时,您将丢弃11个(或12个没有延迟槽).

The best case is when the branch is last (youngest in program order) in a group of 3. Then you're discarding 11 (or 12 with no delay slot).

因此,对于不带延迟槽的3宽版本的管道,根据先前管道阶段中的气泡,您将杀死管道中已经存在的0..14指令.

实施延迟槽很烂;有一个原因是较新的ISA不会公开该管道的详细信息.长期痛苦会带来短期收益.

Implementing a delay slot sucks; there's a reason newer ISAs don't expose that pipeline detail. Long-term pain for short-term gain.

这篇关于在6级标量或超标量MIPS中,有多少条指令因未命中而需要杀死?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆