CPU如何执行操作小于字大小的数据的操作 [英] How does CPU perform operation that manipulate data that's less than a word size

查看:65
本文介绍了CPU如何执行操作小于字大小的数据的操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经读过,当CPU从内存中读取时,它将立即读取内存的字长(例如4字节或8字节).CPU如何实现以下目标:

I had read that when CPU read from memory, it will read word size of memory at once (like 4 bytes or 8 bytes). How can CPU achieve something like:

 mov     BYTE PTR [rbp-20], al

它仅将al的一个字节数据复制到堆栈.(假设数据总线宽度为64位宽),如果任何人都可以提供有关如何在硬件级别实现它的信息,那将是很棒的.

where it copies only one byte of data from al to the stack. (given the data bus width is like 64 bit wide) Will be great if anyone can provide information on how it's implemented on the hardware level.

而且,众所周知,当CPU执行程序时,它具有指向下一条指令的地址的程序计数器或指令指针,控制单元将把该指令提取到存储器数据寄存器中并在以后执行.假设:

And also, as we all know that when CPU execute program, it has program counter or instruction pointer that points to the address of the next instruction, and the control unit will fetch that instruction to the memory data register and executes it later. let's say:

0:  b8 00 00 00 00          mov    eax,0x0

是5字节代码长(在x84上),并且

is 5 byte code long (on x84) and

0:  31 c0                   xor    eax,eax

2字节代码长,它们具有各种长度的长度.

is 2 byte code long, they have various length of size.

如果控制单元要获取这些指令,请执行以下操作:

if the control unit wants to fetch these instructions, does it:

  1. 获取8个字节的字节码(可能包含多个指令),然后仅执行其中的一部分.
  2. 获取少于8个字节的指令(仍从内存中读取8个字节,但其他字节将被忽略)
  3. 说明已经被填充(由编译器之类的东西).

有关诸如以下说明的信息:

what about instructions like :

0:  48 b8 5c 8f c2 f5 28    movabs rax,0x28f5c28f5c28f5c
7:  5c 8f 02

超过字长,CPU如何处理它们?

which exceeds the word size, how are they being handled by the CPU?

推荐答案

x86根本不是面向单词的体系结构.指令的长度是可变的,没有对齐.

x86 is not a word-oriented architecture at all. Instructions are variable length with no alignment.

字号"在x86上不是有意义的术语;有些人可能会用它来指代寄存器的宽度,但是取/解码指令与整数寄存器无关.

"Word size" is not a meaningful term on x86; some people may use it to refer to the register width, but instruction fetch / decode has nothing to do with the integer registers.

实际上,在大多数现代x86 CPU上,从L1指令高速缓存中获取指令是在对齐的16字节或32字节获取块中进行的.之后的流水线阶段会找到指令边界并并行解码多达5条指令(例如Skylake).有关前端的框图,请参见 David Kanter关于Haswell的文章.显示了从L1i缓存中提取的16字节指令.

In practice on most modern x86 CPUs, instruction fetch from the L1 instruction cache happens in aligned 16-byte or 32-byte fetch blocks. Later pipeline stages find instruction boundaries and decode up to 5 instructions in parallel (e.g. Skylake). See David Kanter's write-up of Haswell for a block diagram of the front-end showing 16-byte instruction fetch from L1i cache.

但是请注意,现代的x86 CPU也使用解码的uop缓存,因此对于频繁运行的代码,它们不必处理难以解码的x86机器代码(例如,在一个循环内,甚至是一个大循环内)).处理变长未对齐指令是旧版CPU的一个重大瓶颈.

But note that modern x86 CPUs also use a decoded-uop cache so they don't have to deal with the hard-to-decode x86 machine code for code that runs very frequently (e.g. inside a loop, even a large loop). Dealing with variable-length unaligned instructions is a significant bottleneck on older CPUs.

请参见无法存储现代x86硬件的信息了解有关缓存如何吸收存储到普通存储区(MTRR和/或PAT设置为WB =回写存储类型)的更多信息.

See Can modern x86 hardware not store a single byte to memory? for more about how the cache absorbs stores to normal memory regions (MTRR and/or PAT set to WB = Write-Back memory type).

将存储从存储缓冲区提交到现代Intel CPU上的L1数据高速缓存的逻辑,只要它完全包含在一个64字节的高速缓存行中,就可以处理任何宽度的任何存储.

The logic that commits stores from the store buffer to L1 data cache on modern Intel CPUs handles any store of any width as long as it's fully contained within one 64-byte cache line.

面向字的非x86 CPU(如ARM)通常使用缓存 word (4或8字节)的读取-修改-写入来处理狭窄的存储区.参见是否有任何现代CPU的缓存字节存储区实际上比字存储区慢?但是现代x86 CPU确实花费晶体管来使缓存的字节存储区或未对齐的宽存储区的效率与对齐的8字节存储区的效率完全相同.缓存.

Non-x86 CPUs that are more word-oriented (like ARM) commonly use a read-modify-write of a cache word (4 or 8 bytes) to handle narrow stores. See Are there any modern CPUs where a cached byte store is actually slower than a word store? But modern x86 CPUs do spend the transistors to make cached byte stores or unaligned wider stores exactly as efficient as aligned 8-byte stores into cache.

鉴于数据总线的宽度约为64位

given the data bus width is like 64 bit wide

现代的x86具有内置于CPU的内存控制器.DDR [1234] SDRAM总线具有64条数据线,但是单个读取或写入命令会启动8次传输的突发,从而传输64个 bytes 数据.(并非巧合的是,所有现有x86 CPU的缓存行大小为64字节.)

Modern x86 has memory controllers built-in to the CPU. That DDR[1234] SDRAM bus has 64 data lines, but a single read or write command initiates a burst of 8 transfers, transferring 64 bytes of data. (Not coincidentally, 64 bytes is the cache line size for all existing x86 CPUs.)

对于存储到不可缓存内存区域的存储(即,即使将CPU配置为即使该地址由DRAM支持也将其视为不可缓存的地址),也可以使用

For a store to an uncacheable memory region (i.e. if the CPU is configured to treat that address as uncacheable even though it's backed by DRAM), a single-byte or other narrow store is possible using the DQM byte-mask signals which tell the DRAM memory which of the 8 bytes are actually to be stored from this burst transfer.

(如果不支持,则(可能是这种情况),内存控制器可能必须读取旧内容并合并,然后存储整行,无论哪种方式,都是4字节或8字节的块 not 这里的重要单位是DDR突发传输可以缩短,但只能从64减少到32字节.我认为8字节对齐的写入在DRAM级别上实际上不是很特别.即使在不可缓存的MMIO区域,它也保证在x86 ISA中是原子的".

(Or if that's not supported (which may be the case), the memory controller may have to read the old contents and merge, then store the whole line. Either way, 4-byte or 8-byte chunks are not the significant unit here. DDR burst transfers can be cut short, but only to 32 bytes down from 64. I don't think an 8-byte aligned write is actually very special at the DRAM level. It is guaranteed to be "atomic" in the x86 ISA, though, even on uncacheable MMIO regions.)

存储到不可缓存的MMIO区域将导致适当大小的PCIe事务,最大为64个字节.

A store to an uncacheable MMIO region will result in a PCIe transaction of the appropriate size, up to 64 bytes.

在CPU内核内部,数据缓存和执行单元之间的总线可以为32或64字节宽.(或在当前的AMD上为16个字节).L1d或L2d缓存之间的缓存线传输也通过64字节宽的总线(在Haswell和更高版本上)进行.

Inside the CPU core, the bus between data cache and execution units can be 32 or 64 bytes wide. (Or 16 bytes on current AMD). And transfers of cache lines between L1d can L2 cache is also done over a 64-byte wide bus, on Haswell and later.

这篇关于CPU如何执行操作小于字大小的数据的操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆