CPU如何正确解码可变长度指令? [英] How does the CPU decode variable length instructions correctly?
问题描述
在大多数体系结构上,指令都是固定长度的.这使得程序的加载和执行变得简单.在x86/x64上,指令的长度是可变的,因此反汇编的程序可能如下所示:
On most architectures, instructions are all fixed-length. This makes program loading and executing straightforward. On x86/x64, instructions are variable length, so a disassembled program might look like this:
File Type: EXECUTABLE IMAGE
00401000: 8B 04 24 mov eax,dword ptr [esp]
00401003: 83 C4 04 add esp,4
00401006: FF 64 24 FC jmp dword ptr [esp-4]
0040100A: 55 push ebp
0040100B: E8 F0 FF FF FF call 00401000
00401010: 50 push eax
00401011: 68 00 30 40 00 push 403000h
00401016: E8 0D 00 00 00 call 00401028
0040101B: 83 C4 08 add esp,8
0040101E: 33 C0 xor eax,eax
00401020: 5D pop ebp
00401021: 83 C4 04 add esp,4
00401024: FF 64 24 FC jmp dword ptr [esp-4]
00401028: FF 25 00 20 40 00 jmp dword ptr ds:[00402000h]
Summary
1000 .data
1000 .rdata
1000 .reloc
1000 .text
很难想象CPU如何知道"一条指令在哪里结束而下一条指令在哪里开始.例如,如果我将字节0x90(NOP
)添加到XOR EAX,EAX
操作码的中间,程序将反汇编为:
It seems rather difficult to imagine how the CPU "knows" where one instruction ends and the next one begins. For example, if I add the byte 0x90 (NOP
) to the middle of the XOR EAX,EAX
opcodes the program then disassembles as:
File Type: EXECUTABLE IMAGE
00401000: 8B 04 24 mov eax,dword ptr [esp]
00401003: 83 C4 04 add esp,4
00401006: FF 64 24 FC jmp dword ptr [esp-4]
0040100A: 55 push ebp
0040100B: E8 F0 FF FF FF call 00401000
00401010: 50 push eax
00401011: 68 00 30 40 00 push 403000h
00401016: E8 0D 00 00 00 call 00401028
0040101B: 83 C4 08 add esp,8
0040101E: 33 90 C0 5D 83 C4 xor edx,dword ptr [eax+C4835DC0h]
00401024: 04 FF add al,0FFh
00401026: 64 24 FC and al,0FCh
00401029: FF
0040102A: 25
0040102B: 00 20 add byte ptr [eax],ah
0040102D: 40 inc eax
Summary
1000 .data
1000 .rdata
1000 .reloc
1000 .text
可以预见的是,它在运行时会崩溃.
Which, predictably, crashes when run.
我很好奇指令解码器看到的那个额外的字节,使得它认为0040101E
行是6个字节长,而最初在00401028
行是四个单独的指令.
I'm curious exactly what the instruction decoder sees with that extra byte that makes it think the line 0040101E
is 6 bytes long, and the line originally at 00401028
is four seperate instructions.
推荐答案
在获取指令时,CPU首先分析其第一个字节(操作码).有时知道指令的总长度就足够了.有时,它告诉CPU分析后续字节以确定长度.但总的来说,编码并不是模棱两可的.
When fetching an instruction, the CPU first analyses its first byte (the opcode). Sometimes it's sufficient to know the total length of the instruction. Sometimes it tells the CPU to analyse subsequent bytes to determine the length. But all in all, the encoding is not ambiguous.
是的,如果在willy-nilly中间插入随机字节,命令流就会搞砸了.这是意料之中的;并非每个字节序列都构成有效的机器代码.
Yes, the command stream gets screwed up if you insert random bytes in the middle willy-nilly. That's to be expected; not every byte sequence constitutes valid machine code.
现在,关于您的特定示例.原始命令为XOR EAX, EAX
(33 C0). XOR的编码是那些第二个字节相关的编码之一.第一个字节-33-表示XOR.第二个字节是ModR/M字节.它对操作数进行编码-是否是寄存器对,寄存器和存储位置等.32位模式下的初始值C0对应于操作数EAX,EAX.您插入的值90对应于操作数EDX [EAX + offset],这意味着ModR/M字节后跟32位偏移量.命令流的后四个字节不再被解释为命令-它们是错误的XOR命令中的偏移量.
Now, about your particular example. The original command was XOR EAX, EAX
(33 C0). The encoding of XOR is one of those second byte dependent ones. The first byte - 33 - means XOR. The second byte is the ModR/M byte. It encodes the operands - whether it's a register pair, a register and a memory location, etc. The initial value C0 in 32-bit mode corresponds to operands EAX, EAX. The value 90 that you've inserted corresponds to operands EDX, [EAX+offset], and it means that the ModR/M byte is followed by 32 bits of offset. The next four bytes of the command stream are not interpreted as commands anymore - they're the offset in the mangled XOR command.
因此,通过弄乱第二个字节,您已经将2字节的命令变成了6字节的命令.
So by messing with the second byte, you've turned a 2-byte command into a 6-byte one.
然后,CPU(和反汇编程序)将在这四个之后恢复读取.它位于ADD ESP, 4
指令的中间,但是CPU无法知道这一点.它以04字节开始,ADD编码中的第三个字节.此时的前几个字节仍然对命令有意义,但是由于您已居中,因此原始指令序列将完全丢失.
Then the CPU (and the disassembler) resumes fetching after those four. It's in the middle of the ADD ESP, 4
instruction, but the CPU has no way of knowing that. It starts with the 04 byte, the third one in the ADD encoding. The first few bytes at that point still make sense as commands, but since you've ended up in the middle, the original instruction sequence is utterly lost.
这篇关于CPU如何正确解码可变长度指令?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!