随着可变长度指示如何在电脑知道该指令被获取的长度? [英] With variable length instructions how does the computer know the length of the instruction being fetched?

查看:319
本文介绍了随着可变长度指示如何在电脑知道该指令被获取的长度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在哪里不是所有的指令都是相同长度的架构,如何在电脑知道有多少在一个指令读取?例如,在英特尔IA-32的一些指令是4字节,有些是8字节,所以它是如何知道是否要阅读4个或8个字节?难道当机器上电的第一条指令红有一个已知的大小和每个指令包含下一个的大小?

In architectures where not all the instructions are the same length, how does the computer know how much to read for one instruction? For example in Intel IA-32 some instructions are 4 bytes, some are 8 bytes, so it how does it know whether to read 4 or 8 bytes? Is it that the first instruction red when the machine is powered on has a known size and each instruction contains the size of the next one?

推荐答案

首先,处理器不需要知道多少字节抓取,它可以读取足以为典型或平均提供有针对性的吞吐量字节方便数指令长度。任何额外的字节可以是发生在一个缓冲区中的字节的下一个组中所用是德$ C $光盘。有在宽度相对于指令德code的支承宽度,甚至相对于管道的后面的部分的宽度权衡和抓取的对准。获取超过平均水平的字节可以减少可变性的指令长度和影响的有效的获取带宽与采取控制流指令。

First, the processor does not need to know how many bytes to fetch, it can fetch a convenient number of bytes sufficient to provide the targeted throughput for typical or average instruction lengths. Any extra bytes can be place in a buffer to be used in the next group of bytes to be decoded. There are tradeoffs in the width and alignment of fetch relative to the supported width of instruction decode and even with respect to the width of later parts of the pipeline. Fetching more bytes than average can reduce the impact of variability in instruction length and the effective fetch bandwidth related to taken control flow instructions.

(两者控制流指令可能会引入取泡沫如果[predicted],直到之后的下一个周期获取并减少有效的与那些不是指令少对准取。例如目标获取带宽目标不可用,如果取指令是16字节对齐,如对高性能86-A的分支目标在一大块16 [最后]字节将导致有效地仅code的一个字节是牵强,因为其他15个公共被丢弃。)

(Taken control flow instructions may introduce a fetch bubble if the [predicted] target is not available until a cycle after the next fetch and reduce effective fetch bandwidth with targets that are less aligned than the instruction fetch. E.g., if instruction fetch is 16-byte aligned—as is common for high performance x86—a taken branch that targets the 16th [last] byte in a chunk will result in effectively only one byte of code being fetched as the other 15 bytes are discarded.)

即使是固定长度的指令,每个周期取多条指令推出类似的问题。某些实现(例如,MIPS R10000)将作为取许多指令的可能是,即使他们没有对准德codeD,只要组指令不跨越高速缓存线边界。 (我似乎记得,一个RISC实现ICACHE标签的两家银行,允许取过一个高速缓存块,而不是页边界)。其他实现(例如,POWER4)甚至会为一个分支取code对准块定位在这样的块中的最后一条指令。 (对于POWER4,使用含有8个指令,但最多五个指令可以通过去每个周期code 32字节块。这样,多余的读取宽度可能会被利用通过循环在没有获取执行,并给备用ICACHE周期,节约能源对于一个小姐后,高速缓存块填充,而只有有一个读/写端口到ICACHE。)

Even for fixed length instructions, fetching multiple instructions per cycle introduces similar issues. Some implementations (e.g., MIPS R10000) would fetch as many instructions as could be decoded even if they are not aligned, as long as the group of instructions does not cross a cache line boundary. (I seem to recall that one RISC implementation two banks of Icache tags to allow fetch to cross a cache block—but not page—boundary.) Other implementations (e.g., POWER4) would fetch aligned chunks of code even for a branch targeting the last instruction in such a chunk. (For POWER4, 32 byte chunks were used containing 8 instructions but at most five instructions could pass decode per cycle. This excess fetch width could be exploited to save energy via cycles where no fetch is performed and to give spare Icache cycles for cache block filling after a miss while only having one read/write port to the Icache.)

有关解码每个周期的多个指令,有有效的两种策略:推测德code并行或等待要确定的长度,并使用该信息来指令流解析成单独的指令。对于像IBM的zArchitecture(S / 360后代)的ISA,在16位的包裹长度通过在第一包裹两个比特被平凡决定的,因此在等待中的长度以确定更有意义。 ( RISC五世的稍微复杂的长度指示机构仍然是友好的非投机性德code)。对于这样的microMIPS的编码或Thumb2,其中只有两个长度由主要运code确定和其中不同长度的指令编码是完全不同的,使用非投机德code可为preferred,特别是考虑到有可能窄德code和注重能效,虽然只有两个长度有人猜测可能是在小德code宽度合理的。

For decoding multiple instructions per cycle, there are effectively two strategies: speculatively decode in parallel or wait for the length to be determined and use that information to parse the instruction stream into separate instructions. For an ISA like IBM's zArchitecture (S/360 descendant), the length in 16-bit parcels is trivially determined by two bits in the first parcel, so waiting for the lengths to be determined makes more sense. (RISC V's slightly more complex length indication mechanism would still be friendly to non-speculative decode.) For an encoding like that of microMIPS or Thumb2, which only have two lengths determinable by the major opcode and for which the encoding of different length instructions is substantially different, using non-speculative decode may be preferred, especially given the likely narrow decode and emphasis on energy-efficiency, though with only two lengths some speculation may be reasonable at small decode width.

对于x86,AMD所使用的一种策略,以避免过多的去code的能源使用是指令缓存指示哪些字节结束的指令使用的标志位。用这种标记的位,它是简单的找到每个指令的开始。这种技术具有的优点是增加了一个指令高速缓冲存储器未命中的等待时间(指令必须是preDE codeD)的缺点,它仍然需要去codeRS检查的长度是正确的(例如,的情况下的跳跃制成什么previously指令的中间)。

For x86, one strategy used by AMD to avoid excessive decode energy use is to use marker bits in the instruction cache indicating which byte ends an instruction. With such marker bits, it is simple to find the start of each instruction. This technique has the disadvantage that it adds to the latency of an instruction cache miss (the instructions must be predecoded) and it still requires the decoders to check that the lengths are correct (e.g., in case a jump is made into what was previously the middle of an instruction).

英特尔似乎preFER投机并行德code的方法。由于previous指令在一个块的长度是只有适度的延迟之后德$ C $光盘将可用,在第二和后来德codeRS可能不需要完全去code表示该指令所有的出发点。

Intel seems to prefer the speculative parallel decode approach. Since the length of a previous instruction in a chunk to be decoded will be available after only modest delay, the second and later decoders may not need to fully decode the instruction for all starting points.

由于x86指令可以相对复杂,也有通常德code模板的约束和至少一个较早的设计限制,可以同时保持充分德code带宽使用prefixes的数目。例如,Haswell的限制通过第四指令去codeD的第二个生产只有一个μop而第一个指令可以去code为多达四个μops(使用微code发动机不再μop序列)。基本上,这是在不常见的情况为代价的一般情况(比较简单的指令)的优化。

Since x86 instructions can be relatively complex, there are also often decode template constraints and at least one earlier design restricted the number of prefixes that could be used while maintaining full decode bandwidth. E.g., Haswell limits the second through fourth instructions decoded to producing only one µop while the first instruction can decode into up to four µops (with longer µop sequences using a microcode engine). Basically, this is an optimization for the common case (relatively simple instructions) at the expense of the less common case.

在最近的业绩为导向的设计的x86,Intel已经用它存储在去codeD格式模板避免指令μop缓存和取宽度的限制,并与相关的解码减少能源使用。

In more recent performance-oriented x86 designs, Intel has used a µop cache which stores instructions in decoded format avoiding template and fetch width constraints and reducing energy use associated with decoding.

这篇关于随着可变长度指示如何在电脑知道该指令被获取的长度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆