指令解码器如何区分前缀和主要操作码之间的区别? [英] How does an instruction decoder tell the difference between a prefix and a primary opcode?

查看:19
本文介绍了指令解码器如何区分前缀和主要操作码之间的区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试围绕 x86 指令编码格式进行思考.我阅读的所有资料仍然使这个主题变得混乱.我开始有点理解它,但我无法理解的一件事是 CPU 指令解码器如何区分操作码前缀和操作码.

我知道指令的整个格式基本上取决于操作码(当然在操作码中定义了额外的位域).有时指令没有前缀,操作码是第一个字节.解码器怎么知道?

我假设指令解码器能够分辨出差异,因为操作码字节和前缀字节不会共享相同的二进制值.因此解码器可以判断字节中唯一的二进制数是指令还是前缀.例如(在本例中,我们将坚持使用单字节操作码)REXLOCK 前缀不会与架构指令集中的任何操作码共享相同的字节值.

解决方案

传统(单字节)前缀与您所说的操作码字节不同,因此状态机只能记住它看到的前缀,直到到达操作码字节.

2 字节操作码的 0f 转义字节并不是真正的前缀.它必须与第二个操作码字节相邻.因此,在 0f 之后,任何 字节都是一个操作码,即使它是类似 f2 的东西,否则它会是一个前缀.(这也适用于 SSSE3 及更高版本的 0f 3a0f 38 2 字节转义,或编码这些转义序列之一的 VEX/EVEX 前缀).

如果您查看操作码映射,单字节前缀和操作码之间没有不明确的条目.(例如 http://ref.x86asm.net/coder64.html,并注意2 字节 0F .. 操作码单独列出).


解码器必须知道当前的模式(和其他事情);例如,x86-64 删除了用作 REX 前缀的 1 字节 inc/dec reg 操作码.(x86 32 位操作码x86-x64 或完全删除).我们甚至可以使用这种差异来编写在 32 位与 64 位模式,甚至 区分所有 3 种模式大小.

x86 机器码是一个自同步的字节流(例如,ModRM 或立即数可以是任何字节).CPU 总是知道从哪里开始解码,无论是跳转目标还是前一条指令结束后的字节.这是指令的开始(包括前缀).

内存中的字节只是字节,只有在被 CPU 解码后才成为指令.(虽然在普通程序中,只需从 .text 部分的顶部反汇编确实会给你程序的指令.自我修改和混淆的代码是不正常的.)>

AVX/AVX-512:与操作码重叠的多字节前缀

多字节 VEX 和 EVEX 前缀在 32 位模式下并不那么简单. 例如,在 64 位以外的模式下,VEX 前缀与 LES 和 LDS 的无效编码重叠.(LES 和 LDS 的 c4c5 操作码在 64 位模式下始终无效,除非作为 VEX 前缀.)https://wiki.osdev.org/X86-64_Instruction_Encoding#VEX.2FXOP_opcodes

在旧版/兼容模式下,当 AVX(VEX 前缀)和 AVX-512(EVEX 前缀)时,没有任何剩余的可用字节不是操作码或前缀,因此扩展的唯一空间是作为编码仅对有限的 ModRM 字节集有效的操作码.(例如 LES/LDS 需要内存源,而不是寄存器 - 这就是为什么 VEX 前缀中的某些位被反转的原因,因此 c4c5 之后的字节的前 2 位在 32 位模式下将始终为 1 而不是 0.这就是模式"ModRM 中的字段,11 表示注册).

(有趣的事实:VEX 前缀在 16 位实模式下无法识别,显然是因为某些软件使用了与故意陷阱相同的 LES/LDS 无效编码,需要在 #UD 异常处理程序中进行整理.VEX 前缀 在 16 位保护模式下被识别.)


AMD64 通过删除 AAM 等指令以及 LES/LDS(以及用作 REX 前缀的一字节 inc/dec reg 编码)释放了几个字节),但 CPU 供应商继续关心 32 位模式,并没有添加任何仅在 64 位模式下可用的扩展,这些扩展可以简单地利用那些免费的操作码字节.这意味着找到将新指令编码塞入 32 位机器代码中越来越小的间隙的方法.(通常通过强制前缀,例如 rep bsr = lzcnt 在具有该功能的 CPU 上,会产生不同的结果.)

所以现代 CPU 中支持 AVX/BMI1/2 的解码器必须查看多个字节来决定这是有效 AVX 或其他 VEX 编码指令的前缀,还是 32-bit 模式,如果它应该解码为 LES 或 LDS.(我想看看指令的其余部分来决定是否应该#UD).

但是现代 CPU 无论如何都要一次查看 16 或 32 个字节来并行查找指令边界.(然后将这些指令字节组提供给实际的解码器,再次并行.)https://www.realworldtech.com/sandy-bridge/4/

同样适用于 AMD XOP 使用的前缀方案,这是很多像 VEX.

Agner Fog 的博客文章 停止指令集战争从 2009 年开始(在 AVX 发布后不久,在第一个支持它的硬件之前)有一个用于未来扩展的剩余未使用编码空间表,以及一些关于它被分配"的注释.到 AMD、Intel 或 Via.

相关/示例


机器代码技巧:以多种方式解码相同的字节

(这与前缀真正相关,但总的来说,了解规则如何应用于奇怪的情况可以帮助理解事情的工作原理.)

软件反汇编程序确实需要知道一个起点.如果混淆代码混合了代码和数据,并且实际执行跳转到如果您只是假设您可以按顺序解码而不跟随跳转而无法获得的位置,这可能会出现问题.

幸运的是 编译器生成的代码不会这样做 如此幼稚的静态反汇编(例如通过 objdump -dndisasm,而不是 IDA) 找到与实际运行程序相同的指令边界.

对于运行混淆的机器代码来说,这不是问题;CPU 只做它告诉它的事情,在你告诉它跳转到的地方之前从不关心字节.在不运行/单步执行程序的情况下反汇编是一件困难的事情,尤其是有可能自我修改代码并跳转到一个天真的反汇编者会认为是早期指令的中间.

混淆的机器代码甚至可以以一种方式对指令进行解码,然后跳回到该指令的中间位置,将后面的字节作为操作码(或前缀 + 操作码).如果执行此操作,带有 uop 缓存或在 I-cache 中标记指令边界的现代 CPU 运行缓慢(但正确),因此它更像是一个有趣的代码高尔夫技巧(以牺牲速度为代价的极端代码大小优化)或混淆技术.

有关此示例,请参阅我对 打高尔夫球自定义斐波那契数列.我将摘录与 CPU 在循环回 cfib.loop 后看到的内容一致的反汇编,但请注意,第一次迭代的解码方式不同.所以我只使用循环外的 1 个字节而不是 2 个字节来有效地跳到第一次迭代开始的中间.有关完整说明和其他反汇编,请参阅链接的答案.

0000000000401070 <cfib>:401070: eb .byte 0xeb # jmp rel8 消耗 01 添加操作码作为 rel80000000000401071 <cfib.loop>:401071: 01 d0 添加 eax,edx# 在第一次迭代时循环入口点,跳过 ADD 的 ModRM 字节 (D0)(第一次迭代的条目):401073: 92 xchg edx,eax401074:e2 fb 循环 401071 <cfib.loop>401076:c3 ret

可以使用消耗更多后期字节的操作码来做到这一点,例如 3D ;cmp eax, imm32.当 CPU 看到 3D 操作码字节时,它会抓取接下来的 4 个字节作为立即数.如果您稍后跳转到这 4 个字节,它们将被视为前缀/操作码,并且无论这些字节之前如何被解码为指令的不同部分,一切都将相同(性能问题除外).除了性能之外,CPU 还必须保持一次解码和执行 1 条指令的假象.

我从@Ira Baxter 在 组装的 ASM 代码能否产生不止一种可能的方式(偏移值除外)?

I'm trying to wrap my head around the x86 instruction encoding format. All the sources that I read still make the subject confusing. I'm starting to understand it a little bit but one thing that I'm having trouble with understanding is how the CPU instruction decoder differentiates an opcode prefix from an opcode.

I'm aware that the whole format of the instruction basically depends on the opcode (with extra bit fields defined in the opcode of course). Sometimes the instruction doesn't have a prefix and the opcode is the first byte. How would the decoder know?

I'm assuming that the instruction decoder would be able to tell the difference because opcode bytes and prefix bytes would not share the same binary values. So the decoder can tell if the unique binary number in the byte is an instruction or a prefix. For example (In this example we will stick to single byte opcodes) a REX or LOCK prefix would not share the same byte value as any opcode in the architecture's instruction set.

解决方案

Traditional (single-byte) prefixes are different from opcode bytes like you said, so a state machine can just remember which prefixes it's seen until it gets to an opcode byte.

The 0f escape byte for 2-byte opcodes not really a prefix. It has to be contiguous with the 2nd opcode byte. Thus, following a 0f, any byte is an opcode, even if it's something like f2 that would otherwise be a prefix. (This also applies following 0f 3a or 0f 38 2-byte escapes for SSSE3 and later, or VEX/EVEX prefixes that encode one of those escape sequences).

If you look at an opcode map, there are no entries that are ambiguous between single-byte prefix and opcode. (e.g. http://ref.x86asm.net/coder64.html, and notice how the 2-byte 0F .. opcodes are listed separately).


The decoders do have to know the current mode for this (and other things); for example x86-64 removed the 1-byte inc/dec reg opcodes for use as REX prefixes. (x86 32 bit opcodes that differ in x86-x64 or entirely removed). We can even use this difference to write polyglot machine code that runs differently when decoded in 32-bit vs. 64-bit mode, or even distinguish all 3 mode sizes.

x86 machine code is a byte stream that's not self-synchronizing (e.g. a ModRM or an immediate can be any byte). The CPU always knows where to start decoding from, either a jump target or the byte after the end of a previous instruction. That's the start of the instruction (including prefixes).

Bytes in memory are just bytes, only becoming instructions when they're decoded by the CPU. (Although in normal programs, simply disassembling from the top of the .text section does give you the program's instructions. Self-modifying and obfuscated code are not normal.)

AVX / AVX-512: multi-byte prefixes that overlap with opcodes

Multi-byte VEX and EVEX prefixes aren't that simple in 32-bit mode. For example VEX prefixes overlap with invalid encodings of LES and LDS in modes other than 64-bit. (The c4 and c5 opcodes for LES and LDS are always invalid in 64-bit mode, except as VEX prefixes.) https://wiki.osdev.org/X86-64_Instruction_Encoding#VEX.2FXOP_opcodes

In legacy / compat modes, there weren't any free bytes left that weren't already opcodes or prefixes when AVX (VEX prefixes) and AVX-512 (EVEX prefix), so the only room for extensions was as encodings for opcodes that are only valid with a limited set of ModRM bytes. (e.g. LES / LDS require a memory source, not register - this is why some bits are inverted in VEX prefixes, so the top 2 bits of the byte after c4 or c5 will always be 1 in 32-bit mode instead of 0. That's the "mode" field in ModRM, and 11 means register).

(Fun fact: VEX prefixes are not recognized in 16-bit real mode, apparently because some software used the same invalid encodings of LES / LDS as intentional traps, to be sorted out in the #UD exception handler. VEX prefixes are recognized in 16-bit protected mode, though.)


AMD64 freed up several bytes by removing instructions like AAM, as well as LES/LDS (and the one-byte inc/dec reg encodings for use as REX prefixes), but CPU vendors have continued to care about 32-bit mode and not added any extensions that are only available in 64-bit mode which could simply take advantage of those free opcode bytes. This means finding ways to cram new instruction encodings into increasingly small gaps in 32-bit machine code. (Often via mandatory prefixes, e.g. rep bsr = lzcnt on CPUs with that feature, which gives different results.)

So the decoders in modern CPUs that support AVX / BMI1/2 have to look at multiple bytes to decide whether this is a prefix for a valid AVX or other VEX-encoded instruction, or in 32-bit mode if it should decode as LES or LDS. (And I guess look at the rest of the instruction to decide if it should #UD).

But modern CPUs are looking at 16 or 32 bytes at a time anyway to find instruction boundaries in parallel. (And then later feed those groups of instruction bytes to actual decoders, again in parallel.) https://www.realworldtech.com/sandy-bridge/4/

Same goes for the prefix scheme used by AMD XOP, which is a lot like VEX.

Agner Fog's blog article Stop the instruction set war from 2009 (soon after AVX was announced, before the first hardware supporting it) has a table of remaining unused coding space for future extensions, and some notes about it being "assigned" to AMD, Intel, or Via.

Related / examples


Machine code tricks: decoding the same byte multiple ways

(This is not really related to prefixes, but in general seeing how the rules apply to weird cases can help understand exactly things work.)

A software disassembler does need to know a start point. This can be problematic if obfuscated code mixes code and data, and actual execution jumps to places you wouldn't get if you just assume that you can decode in order without following jumps.

Fortunately compiler-generated code doesn't do that so naive static disassembly (e.g. by objdump -d or ndisasm, as opposed to IDA) finds the same instruction boundaries that actually running the program will.

This is not a problem for running obfuscated machine code; the CPU just does what it's told, and never cares about bytes before the place you tell it to jump to. Disassembling without running / single-stepping the program is the hard thing, especially with the possibility of self-modifying code and jumps to what a naive disassembler would think was the middle of an earlier instruction.

Obfuscated machine code can even have an instruction decode one way, then jump back into what was the middle of that instruction, for a later byte to be the opcode (or prefix + opcode). Modern CPUs with uop caches or that mark instruction boundaries in I-cache run slow (but correctly) if you do this, so it's more of a fun code-golf trick (extreme code-size optimization at the expense of speed) or obfuscation technique.

For an example of this, see my codegolf.SE x86 machine code answer to Golf a Custom Fibonacci Sequence. I'll excerpt the disassembly that lines up with what the CPU sees after looping back to cfib.loop, but note that the first iteration decodes differently. So I'm using just 1 byte outside the loop instead of 2 to effectively jump into the middle for the start of the first iteration. See the linked answer for a full description and the other disassembly.

0000000000401070 <cfib>:
  401070:       eb                      .byte 0xeb      # jmp rel8 consuming the 01 add opcode as a rel8
0000000000401071 <cfib.loop>:
  401071:       01 d0                   add    eax,edx
# loop entry point on first iteration, jumping over the ModRM byte (D0) of the ADD
    (entry on first iteration):
  401073:       92                      xchg   edx,eax
  401074:       e2 fb                   loop   401071 <cfib.loop>
  401076:       c3                      ret 

You can do this with opcodes that consume more later bytes, like 3D <dword> cmp eax, imm32. When the CPU sees a 3D opcode byte, it will grab the next 4 bytes as the immediate. If you later jump into those 4 bytes, they'll be considered as prefix/opcodes and everything will work (except for performance problems) the same regardless of how those bytes had previously been decoded as a different part of an instruction. The CPU has to maintain the illusion of decoding and executing 1 instruction at a time, other than performance.

I learned of this trick from @Ira Baxter's answer on Can assembled ASM code result in more than a single possible way (except for offset values)?

这篇关于指令解码器如何区分前缀和主要操作码之间的区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆