对于可变长度指令,计算机如何知道正在获取的指令的长度? [英] With variable length instructions how does the computer know the length of the instruction being fetched?

查看:18
本文介绍了对于可变长度指令,计算机如何知道正在获取的指令的长度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在并非所有指令长度都相同的体系结构中,计算机如何知道一条指令要读取多少?例如在 Intel IA-32 中有些指令是 4 字节,有些是 8 字节,那么它如何知道是读取 4 字节还是 8 字节呢?是不是机器开机时红色的第一条指令有一个已知的大小,并且每条指令都包含下一条指令的大小?

解决方案

首先,处理器不需要知道要获取多少字节,它可以获取足够方便的字节数,足以为典型或平均提供目标吞吐量指令长度.任何额外的字节都可以放在缓冲区中,以便在下一组要解码的字节中使用.相对于支持的指令解码宽度,甚至相对于流水线后期部分的宽度,取指的宽度和对齐方式存在折衷.获取比平均值更多的字节可以减少指令长度可变性的影响,以及与采用的控制流指令相关的有效获取带宽.

(如果[预测的]目标直到下一次取指之后的一个周期才可用,则所采取的控制流指令可能会引入取指气泡,并减少与指令取指对齐程度较低的目标的有效取指带宽.例如,如果指令取指是 16 字节对齐的——这对于高性能 x86 来说很常见——针对块中第 16 个 [最后] 字节的分支将导致有效地仅获取一个字节的代码,而其他 15 个字节将被丢弃.)

即使对于固定长度的指令,每个周期获取多条指令也会引入类似的问题.一些实现(例如,MIPS R10000)将获取尽可能多的指令,即使它们没有对齐,只要指令组不跨越缓存线边界.(我似乎记得一个 RISC 实现有两个 Icache 标签库,以允许获取跨越缓存块 - 但不是页面 - 边界.)其他实现(例如,POWER4)即使对于针对最后一个分支的分支也会获取对齐的代码块这样一个块中的指令.(对于 POWER4,使用包含 8 条指令的 32 字节块,但每个周期最多 5 条指令可以通过解码.可以利用这种多余的提取宽度通过不执行提取的周期来节省能源,并为缓存块填充提供备用的 Icache 周期在一次未命中而只有一个读/写端口到 Icache 时.)

对于每个周期解码多条指令,有两种有效的策略:并行推测解码或等待确定长度并使用该信息将指令流解析为单独的指令.对于像 IBM 的 zArchitecture(S/360 后代)这样的 ISA,16 位包中的长度由第一个包中的两个位决定,因此等待确定长度更有意义.(RISC V 稍微复杂一点的长度指示机制对于非推测解码仍然是友好的.)对于像microMIPS 或 Thumb2,只有两个长度可由主要操作码确定,并且不同长度指令的编码有很大不同,使用非推测性解码可能是首选,特别是考虑到可能的窄解码和强调能源效率,尽管只有两个长度的一些推测在较小的解码宽度下可能是合理的.

对于 x86,AMD 用来避免过度使用解码能量的一种策略是在指令缓存中使用标记位,指示哪个字节结束指令.有了这样的标记位,很容易找到每条指令的开始.这种技术的缺点是它增加了指令缓存未命中的延迟(指令必须被预解码),并且它仍然需要解码器检查长度是否正确(例如,如果跳转到以前的内容)指令的中间).

英特尔似乎更喜欢推测性并行解码方法.由于要解码的块中前一条指令的长度仅在适度延迟后可用,因此第二个和后面的解码器可能不需要完全解码所有起始点的指令.

由于 x86 指令可能相对复杂,因此通常还存在解码模板约束,并且至少有一个早期设计限制了在保持完整解码带宽的同时可以使用的前缀数量.例如,Haswell 将解码的第二到第四条指令限制为仅产生一个微操作,而第一条指令最多可以解码为四个微操作(使用微代码引擎的微操作序列更长).基本上,这是对常见情况(相对简单的指令)的优化,以牺牲不太常见的情况为代价.

在最近的面向性能的 x86 设计中,英特尔使用了 µop 缓存,该缓存以解码格式存储指令,避免模板和获取宽度限制,并减少与解码相关的能耗.

In architectures where not all the instructions are the same length, how does the computer know how much to read for one instruction? For example in Intel IA-32 some instructions are 4 bytes, some are 8 bytes, so it how does it know whether to read 4 or 8 bytes? Is it that the first instruction red when the machine is powered on has a known size and each instruction contains the size of the next one?

解决方案

First, the processor does not need to know how many bytes to fetch, it can fetch a convenient number of bytes sufficient to provide the targeted throughput for typical or average instruction lengths. Any extra bytes can be place in a buffer to be used in the next group of bytes to be decoded. There are tradeoffs in the width and alignment of fetch relative to the supported width of instruction decode and even with respect to the width of later parts of the pipeline. Fetching more bytes than average can reduce the impact of variability in instruction length and the effective fetch bandwidth related to taken control flow instructions.

(Taken control flow instructions may introduce a fetch bubble if the [predicted] target is not available until a cycle after the next fetch and reduce effective fetch bandwidth with targets that are less aligned than the instruction fetch. E.g., if instruction fetch is 16-byte aligned—as is common for high performance x86—a taken branch that targets the 16th [last] byte in a chunk will result in effectively only one byte of code being fetched as the other 15 bytes are discarded.)

Even for fixed length instructions, fetching multiple instructions per cycle introduces similar issues. Some implementations (e.g., MIPS R10000) would fetch as many instructions as could be decoded even if they are not aligned, as long as the group of instructions does not cross a cache line boundary. (I seem to recall that one RISC implementation two banks of Icache tags to allow fetch to cross a cache block—but not page—boundary.) Other implementations (e.g., POWER4) would fetch aligned chunks of code even for a branch targeting the last instruction in such a chunk. (For POWER4, 32 byte chunks were used containing 8 instructions but at most five instructions could pass decode per cycle. This excess fetch width could be exploited to save energy via cycles where no fetch is performed and to give spare Icache cycles for cache block filling after a miss while only having one read/write port to the Icache.)

For decoding multiple instructions per cycle, there are effectively two strategies: speculatively decode in parallel or wait for the length to be determined and use that information to parse the instruction stream into separate instructions. For an ISA like IBM's zArchitecture (S/360 descendant), the length in 16-bit parcels is trivially determined by two bits in the first parcel, so waiting for the lengths to be determined makes more sense. (RISC V's slightly more complex length indication mechanism would still be friendly to non-speculative decode.) For an encoding like that of microMIPS or Thumb2, which only have two lengths determinable by the major opcode and for which the encoding of different length instructions is substantially different, using non-speculative decode may be preferred, especially given the likely narrow decode and emphasis on energy-efficiency, though with only two lengths some speculation may be reasonable at small decode width.

For x86, one strategy used by AMD to avoid excessive decode energy use is to use marker bits in the instruction cache indicating which byte ends an instruction. With such marker bits, it is simple to find the start of each instruction. This technique has the disadvantage that it adds to the latency of an instruction cache miss (the instructions must be predecoded) and it still requires the decoders to check that the lengths are correct (e.g., in case a jump is made into what was previously the middle of an instruction).

Intel seems to prefer the speculative parallel decode approach. Since the length of a previous instruction in a chunk to be decoded will be available after only modest delay, the second and later decoders may not need to fully decode the instruction for all starting points.

Since x86 instructions can be relatively complex, there are also often decode template constraints and at least one earlier design restricted the number of prefixes that could be used while maintaining full decode bandwidth. E.g., Haswell limits the second through fourth instructions decoded to producing only one µop while the first instruction can decode into up to four µops (with longer µop sequences using a microcode engine). Basically, this is an optimization for the common case (relatively simple instructions) at the expense of the less common case.

In more recent performance-oriented x86 designs, Intel has used a µop cache which stores instructions in decoded format avoiding template and fetch width constraints and reducing energy use associated with decoding.

这篇关于对于可变长度指令,计算机如何知道正在获取的指令的长度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆