计算内存访问 [英] Calculate memory accesses

查看:93
本文介绍了计算内存访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

xor dword [0x301a80],0x12345

当我们知道操作码和寻址模式为2个字节时,有多少次内存访问?

如果我理解正确,甚至认为它是0x12345,那么它实际上仍然是4个字节,我们无法将其附加到0x301a80,对吗?

所以我们在这里:

2 + 4 + 4个字节(而不是2 + 3.5 + 2.5 = 8),即4个内存访问.

我认为正确吗?

解决方案

总指令大小为10个字节(在32位模式下).在现代的x86上,可能需要0到2个I缓存访问才能获取对齐的16字节访问. (如果它在uop缓存中命中,则为0.

执行时,它将进行4字节加载+ 4字节存储(在对齐地址上),这应该是除386SX(16位总线)以外的CPU上总共进行2次数据访问.除非内存区域不可缓存,否则它们可能会进入缓存.

如果启用了分页,则在该地址的TLB未命中的情况下,通过页面漫游可能会产生更多负载. (而且,如果在VM内运行,来宾页表和宿主页表都可能与嵌套页表有关.如果#PF页出错,则总体而言,这将贵得多,但是计算操作系统可能要做的工作却很愚蠢.)/p>

如果您想知道一条指令触及的字节总数,请参见 x86指令是否需要它们自己的编码以及所有它们的参数要同时出现在内存中?谈到指令+数据同时存在于内存中以使向前执行成为可能.但是似乎您正在计算的是访问次数,而不是所访问字节的占用空间.但是您还没有说什么微体系结构. x86的范围很广,从可以运行该指令的第一个32位CPU(386)到具有尝试并行执行大量操作的宽管道的现代x86.


如果您的意思是操作码和寻址模式", =操作码+ ModRM 字节,然后是2字节.大多数人会考虑将寻址模式"设置为地址模式".包括4字节的disp32以及ModRM(表示使用了哪种寻址模式以及是否存在位移字节).立即数也是4个字节.因此,我认为您的"2 + 4 + 4"大小计算是将总指令加起来,而不是对数据访问进行计数.是的,总共10个字节是正确的.

使用汇编程序查看指令大小.例如nasm -felf32 -l/dev/stdout foo.asm,其中包含该指令的文件:

$ cat > foo.asm   # then paste your instruction
xor dword [0x301a80], 0x12345
<control-d for EOF>
$ nasm -felf32 -l/dev/stdout foo.asm
     1 00000000 8135801A3000452301-     xor dword [0x301a80], 0x12345
     1 00000009 00
$ objdump -drwC -Mintel foo.o   # nicer disassembly format, not line-wrapped
...
   0:   81 35 80 1a 30 00 45 23 01 00   xor    DWORD PTR ds:0x301a80,0x12345

  • 在32位模式下:一条10字节的指令:操作码+ modrm + disp32 + imm32.

  • 在64位模式下:11个字节(+ SIB编码32位绝对地址;较短的编码重新用于相对RIP).

  • 在16位模式下:12个字节:66和67个操作数和地址大小前缀,位于与32位模式相同的操作码+ modrm + disp32 + imm32前面.

x86机器码只能对具有32位操作数大小的指令执行imm8imm32.您可以在 xor 的手册中看到这一点.所以是的,0x12345立即获取完整的32位dword,而不是2.5或3个字节. x86机器代码是字节流,但是任何给定指令的生成部分都只有几个固定大小.寻址模式下的位移相同.


我不明白您如何获得4次访问"计算出的2 + 4 + 4 = 10字节的总大小.如果您只是在谈论从内存中加载指令,您是否想像它是操作码的1字节加载,然后是1字节的modrm,然后是4字节的disp32和imm32?也许不是,因为您没有将其写为1 + 1 + 4 + 4.

无论如何,这不是CPU的工作方式.旧的x86 CPU具有预取缓冲区,该缓冲区将它们填充为总线宽度对齐的访问,然后从该缓冲区进行解码.他们不能只通过一次访问就从内存中加载未对齐的dword.带有16位总线的386SX可能已经进行了4次总访问才能提取此指令,如果从奇数地址开始则要进行6次访问.

在具有高速缓存的现代CPU中,从L1i高速缓存中提取指令的过程以16个块为单位,我认为是对齐的. (在自Intel P6以来的CPU上: https://agner.org/optimize/)因此,此说明可能是作为1或2个I-cache访问的一部分获取. (如果跨16字节边界,则为2).

否则它可能根本不需要获取:uop缓存会缓存已解码的指令,而不是x86机器代码,因此只要命中了uop缓存,该指令就可以运行而无需从内存中获取任何代码. (英特尔Sandybridge系列和AMD Zen具有uop缓存;英特尔自Core 2起具有循环缓冲区,仍然可以避免从L1i缓存中实际获取数据,并跳过部分或全部解码工作.) https://www.realworldtech.com/sandy-bridge/对SnB系列有很好的了解.


剩下2个访问:dword数据加载,dword数据存储.地址是16字节对齐的,因此dword加载+存储永远不会拆分为多个访问.但这不是原子RMW(没有lock前缀),因此加载和存储是对相同4个字节的单独内存访问.

自486起,就保证在x86上对原子内存的访问是原子的( 解决方案

The total instruction size is 10 bytes (in 32-bit mode). That takes probably 0 to 2 I-cache accesses on a modern x86 to fetch in aligned 16-byte accesses. (0 if it hits in the uop cache).

When executed, it does a 4-byte load + a 4-byte store (on an aligned address), which should be a total of 2 data accesses on CPUs other than 386SX (16-bit bus). These can probably hit in cache unless the memory region is uncacheable.

More loads could be generated by page walks on a TLB miss for that address, if paging is enabled. (And if running inside a VM, both guest and host page tables could be involved with nested page tables. It would be vastly more expensive overall if it #PF page-faulted, but counting the work an OS might do is silly.)

If you're wondering about the total number of bytes touched by an instruction, see Do x86 instructions require their own encoding as well as all of their arguments to be present in memory at the same time? which talks about instruction + data being in memory at once for forward progress to be possible. But it seems you're counting number of accesses, not the footprint of the bytes accessed. But you haven't said by what microarchitecture. x86 spans huge range from the first 32-bit capable CPU that could run this instruction (386) up to modern x86 with wide pipelines that try to do a lot in parallel.


If you mean "opcode and addressing mode" = opcode + ModRM byte then yes that's 2 bytes. Most people would consider "the addressing mode" to include the 4-byte disp32 as well as the ModRM (that signals which addressing mode is used and the presence of displacement bytes). The immediate is also 4 bytes. So I think your "2+4+4" size calculation is adding up pieces of the total instruction and not counting data accesses. And yes, that 10 bytes total is correct.

Use an assembler to see instruction sizes. e.g. nasm -felf32 -l/dev/stdout foo.asm with a file containing that instruction:

$ cat > foo.asm   # then paste your instruction
xor dword [0x301a80], 0x12345
<control-d for EOF>
$ nasm -felf32 -l/dev/stdout foo.asm
     1 00000000 8135801A3000452301-     xor dword [0x301a80], 0x12345
     1 00000009 00
$ objdump -drwC -Mintel foo.o   # nicer disassembly format, not line-wrapped
...
   0:   81 35 80 1a 30 00 45 23 01 00   xor    DWORD PTR ds:0x301a80,0x12345

  • In 32-bit mode: a 10 byte instruction: opcode + modrm + disp32 + imm32.

  • In 64-bit mode: 11 bytes (+SIB to encode the 32-bit absolute address; the shorter encoding was re-purposed for RIP-relative).

  • In 16-bit mode: 12 bytes: 66 and 67 operand and address-size prefixes in front of the same opcode + modrm + disp32 + imm32 as 32-bit mode.

x86 machine code can only do imm8 or imm32 for an instruction with 32-bit operand-size. You can see that in the manual for xor specifically. So yes, 0x12345 takes a full 32-bit dword immediate, not 2.5 or 3 bytes. x86 machine code is a byte stream, but there are only a few fixed sizes for the pieces any given instruction is built from. Same deal for the displacement in the addressing mode.


I don't understand how you're getting 4 "accesses" for the 2 + 4 + 4 = 10 byte total size you calculated. If you're just talking about loading the instruction from memory, are you picturing that it's a 1-byte load of the opcode, then 1 byte for the modrm, then 4 bytes each for the disp32 and imm32? Maybe not, since you didn't write it as 1 + 1 + 4 + 4.

In any case, that's not how CPUs work. Old x86 CPUs have a prefetch buffer that they fill with bus-width aligned accesses, then decode from that buffer. They can't just load an unaligned dword from memory with a single access. 386SX with its 16-bit bus might have taken 4 total accesses to fetch this instruction, or 6 if it started at an odd address.

In modern CPUs with caches, instruction fetch from L1i cache happens in blocks of 16, aligned I think. (On CPUs since Intel P6: https://agner.org/optimize/) So this instruction might be fetched as part of 1 or 2 I-cache accesses. (2 if it's split across a 16-byte boundary).

Or it might not need to get fetched at all: the uop cache caches decoded instructions, not x86 machine code, so with a uop-cache hit this instruction can run without any code fetch from memory. (Intel Sandybridge-family and AMD Zen have uop caches; Intel since Core 2 has a loop buffer that can still avoid actual fetch from L1i cache, and skip some or all of the decode work.) https://www.realworldtech.com/sandy-bridge/ has a good deep-dive into SnB-family.


That leaves 2 accesses: dword data load, dword data store. The address is 16-byte aligned so a dword load + store is never going to split into multiple accesses. But it's not an atomic RMW (no lock prefix) so the load and store are separate memory accesses to the same 4 bytes.

A dword memory access is guaranteed atomic on x86 since 486 (Why is integer assignment on a naturally aligned variable atomic on x86?), so any non-ancient CPU will do each of those accesses as a single operation (to cache, or to memory if that's an uncacheable address).

Or this could run on a 386SX where each dword data access happens as two 16-bit bus operations. Full 32-bit bus 386 chips also existed which would do full dword load or store as a single access, like later CPUs.

这篇关于计算内存访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆