x86 寄存器:MBR/MDR 和指令寄存器 [英] x86 registers: MBR/MDR and instruction registers

查看:57
本文介绍了x86 寄存器:MBR/MDR 和指令寄存器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据我所知,IA-32 架构有十个 32 位寄存器和六个 16 位寄存器.

From what I have read, the IA-32 architecture has ten 32-bit and six 16-bit registers.

32位寄存器如下:

  • 数据寄存器 - EAX、EBX、ECX、EDX
  • 指针寄存器 - EIP、ESP、EBP
  • 索引寄存器 - ESI、EDI
  • 控制寄存器 - EFLAG(EIP 也被归类为控制寄存器)

16 位寄存器如下:

  • 代码段:它包含要执行的所有指令.
  • 数据段:它包含数据、常量和工作区.
  • 堆栈段:它包含过程或子程序的数据和返回地址.
  • 额外段 (ES).指向额外数据的指针.
  • F 段 (FS).指向更多额外数据的指针.
  • G 段 (GS).指向更多额外数据的指针.

但是,我找不到有关当前指令寄存器 (CIR) 或内存缓冲寄存器 (MBR)/内存数据寄存器 (MBR) 的任何信息.这些寄存器是否被称为其他东西?这些寄存器是 32 位的吗?

However, I can't find any information on the Current Instruction Register (CIR) or Memory Buffer Registers (MBR)/Memory Data Registers (MBR). Are these registers referred to as something else? And are these registers 32-bit?

我假设它们是 32 位的,并且这种架构下最常用的指令长度不到 4 个字节.从观察来看,很多指令似乎都在4个字节以下,例如:

I assume they are 32-bit and that most commonly used instructions under this architecture are under 4 bytes long. From observation, many instructions seem to be under 4 bytes, for example:

  • 推送 EBP (55)
  • MOV EBP、ESP (8B EC)
  • LEA (8D 44 38 02)

对于更长的指令,CPU 将使用前缀代码和其他可选代码.更长的指令将需要一个以上的周期来完成,这取决于指令长度.

For longer instruction, the CPU will use prefix codes and other optional codes. Longer instructions will require more than one cycle to complete which will depend on instruction length.

所讨论的寄存器长度为 32 位,我是否正确?IA-32 架构中是否还有其他寄存器需要我注意?

Am I correct in that the registers in question are 32-bit in length? And are there any other registers in the IA-32 architecture that I should also be aware of?

推荐答案

不,您所谈论的寄存器是一个实现细节,在现代 x86 CPU 中并不作为物理寄存器存在.

No, the registers you're talking about are an implementation detail that don't exist as physical registers in modern x86 CPUs.

x86 没有指定您在玩具/教学 CPU 设计中找到的任何实现细节.x86 手册仅指定在架构上可见的内容.

Intel 和 AMD 的优化手册详细介绍了内部实现,与您所建议的完全不同.现代 x86 CPU 将架构寄存器重命名为更大的物理寄存器文件,从而实现乱序执行,而不会因先写后写或先读后写数据危险而停顿.(参见 为什么 mulss 需要Haswell 上只有 3 个周期,与 Agner 的指令表不同? 有关寄存器重命名的更多详细信息).请参阅此答案乱序 exec 的基本介绍,以及实际 Haswell 核心的框图.(请记住,一个物理芯片有多个内核).

Intel and AMD's optimization manuals go into some detail about the internal implementation, and it's nothing like what you're suggesting. Modern x86 CPUs rename the architectural registers onto much larger physical register files, enabling out-of-order execution without stalling for write-after-write or write-after-read data hazards. (See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for more details about register renaming). See this answer for a basic intro to out-of-order exec, and a block diagram of an actual Haswell core. (And remember that a physical chip has multiple cores).

与简单或玩具微架构不同,几乎所有高性能 CPU 都支持未命中和/或未命中(多个未完成的缓存未命中,不会完全阻塞等待第一个完成的内存操作)

Unlike a simple or toy microarchitecture, almost all high-performance CPUs support miss under miss and/or hit under miss (multiple outstanding cache misses, not totally blocking memory operations waiting for the first one to complete)

可以构建一个具有单个 MBR/MDR 的简单 x86;如果最初的 8086 或 386 微体系结构将类似的东西作为内部实现的一部分,我不会感到惊讶.

You could build a simple x86 that had a single MBR / MDR; I wouldn't be surprised if original 8086 and maybe 386 microarchitectures had something like that as part of the internal implementation.

但例如,Haswell 或 Skylake 内核每个周期可以从/到 L1d 缓存执行 2 次加载和 1 次存储(参见 缓存怎么能这么快?).显然,他们不能只有一个 MBR.相反,Haswell 有 72 个加载缓冲区条目和 42 个存储缓冲区条目,它们共同构成内存顺序缓冲区的一部分,它支持加载/存储的乱序执行,同时保持这种错觉只有 StoreLoad 重新排序发生/对其他内核可见.

But for example a Haswell or Skylake core can do 2 loads and 1 store per cycle from/to L1d cache (See How can cache be that fast?). Obviously they can't have just one MBR. Instead, Haswell has 72 load-buffer entries and 42 store-buffer entries, which all together are part of the Memory Order Buffer which supports out-of-order execution of loads / stores while maintaining the illusion that only StoreLoad reordering happens / is visible to other cores.

自 P5 Pentium 以来,自然而然地-对齐的加载/存储高达 64 位保证是原子的,但在此之前只有 32 位访问是原子的.所以是的,如果 386/486 有 MDR,它可能是 32 位.但即使是那些早期的 CPU 也可能在 CPU 和 RAM 之间有缓存.

Since P5 Pentium, naturally-aligned loads/stores up to 64 bits are guaranteed atomic, but before that only 32-bit accesses were atomic. So yes, if 386/486 had an MDR, it could have been 32 bits. But even those early CPUs could have cache between the CPU and RAM.

我们知道 Haswell 和更高版本有一个 256 位L1d 缓存和执行单元之间的路径,即 32 字节,而 Skylake-AVX512 有 64 字节的路径用于 ZMM 加载/存储.AMD CPU 将宽向量操作拆分为 128 位块,因此它们的加载/存储缓冲区条目大概只有 16 字节宽.

We know that Haswell and later have a 256-bit path between L1d cache and execution units, i.e. 32 bytes, and Skylake-AVX512 has 64-byte paths for ZMM loads/stores. AMD CPUs split wide vector ops into 128-bit chunks, so their load/store buffer entries are presumably only 16 bytes wide.

Intel CPU 至少将相邻的存储合并到存储缓冲区内的同一高速缓存行,并且还有 10 个 LFB(行填充缓冲区)用于 L1d 和 L2(或核外到 L3 或 DRAM)之间的挂起传输.

Intel CPUs at least merge adjacent stores to the same cache line within the store buffer, and there are also the 10 LFBs (line-fill buffers) for pending transfers between L1d and L2 (or off-core to L3 or DRAM).

x86 是变长指令集;在前缀之后,最长的指令长于 32 位.即使对于 8086 也是如此.例如,add word [bx+disp16], imm16 是 6 个字节长.但是 8088 只有一个 4 字节的预取队列可供解码(与 8086 的 6 字节队列相比),因此它必须支持解码指令,而无需从内存中加载整个内容.8088/8086 解码前缀一次 1 个周期,4 字节的操作码 + modRM 绝对足以识别指令其余部分的长度,因此它可以对其进行解码,然后获取 disp16 和/或 imm16(如果它们不是)t 还没有取.现代 x86 可以有更长的指令,特别是 SSSE3/SSE4 需要许多强制前缀作为操作码的一部分.

x86 is a variable-length instruction set; after prefixes, the longest instruction is longer than 32 bits. This was true even for 8086. For example, add word [bx+disp16], imm16 is 6 bytes long. But 8088 only had a 4-byte prefetch queue to decode from (vs. 8086's 6 byte queue), so it had to support decoding instructions without having loaded the whole thing from memory. 8088 / 8086 decoded prefixes 1 cycle at a time, and 4 bytes of opcode + modRM is definitely enough to identify the length of the rest of the instruction, so it could decode it and then fetch the disp16 and/or imm16 if they weren't fetched yet. Modern x86 can have much longer instructions, especially with SSSE3 / SSE4 requiring many mandatory prefixes as part of the opcode.

它也是一个 CISC ISA,因此在内部保留实际指令字节不是很有用;您不能像使用简单的 MIPS 那样直接将指令位用作内部控制信号.

在非流水线 CPU 中,是的,某处可能只有一个物理 EIP 寄存器.对于现代 CPU,每条指令都有一个关联的 EIP,但许多指令在 CPU 内同时运行.有序流水线 CPU 可能会将 EIP 与每个阶段相关联,但无序 CPU 必须在每个指令的基础上跟踪它.(实际上每个 uop,因为复杂的指令解码为 1 个以上的内部 uop.)

In a non-pipelined CPU, yes there might be a single physical EIP register somewhere. For modern CPUs, each instruction has an EIP associated with it, but many are in flight at once inside the CPU. An in-order pipelined CPU might associate an EIP with each stage, but an out-of-order CPU would have to track it on a per-instruction basis. (Actually per uop, because complex instructions decode to more than 1 internal uop.)

现代 x86 以 16 或 32 字节的块为单位提取和解码,每个时钟周期最多可解码 5 或 6 条指令,并将解码结果放入队列中,以便前端发送到乱序部分核心.

Modern x86 fetches and decodes in blocks of 16 or 32 bytes, decoding up to 5 or 6 instructions per clock cycle and placing the decode results in a queue for the front-end to issue into the out-of-order part of the core.

另请参阅 https://stackoverflow.com/tags/x86/info 中的 CPU 内部链接,尤其是David Kanter 的文章和 Agner Fog 的微架构指南.

See also the CPU-internals links in https://stackoverflow.com/tags/x86/info, especially David Kanter's write-ups and Agner Fog's microarch guides.

顺便说一句,您遗漏了 x86 的许多控制/调试寄存器.CR0..4 对于 386 启用保护模式、分页和各种其他东西至关重要.您可以在实模式下使用 CPU,仅使用 GP 和段寄存器以及 EFLAGS,但是如果包含操作系统需要管理的非通用寄存器,x86 具有更多的架构寄存器.

BTW, you left out x86's many control / debug registers. CR0..4 are critical for 386 to enable protected mode, paging, and various other stuff. You could use a CPU in real mode only using the GP and segment regs, and EFLAGS, but x86 has far more architectural registers if you include the non-general-purpose regs that the OS needs to manage.

这篇关于x86 寄存器:MBR/MDR 和指令寄存器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆