如何使用CPU本身来判断x86-64指令操作码的长度? [英] How to tell length of an x86-64 instruction opcode using CPU itself?

查看:385
本文介绍了如何使用CPU本身来判断x86-64指令操作码的长度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道有可以解析"二进制机器代码/操作码以告知x86-64 CPU指令的长度.

但是我想知道,由于CPU具有内部电路来确定这一点,是否有一种方法可以使用处理器本身从二进制代码中判断指令大小? (也许甚至是黑客?)

解决方案

Trap标志( TFLAGS/RFLAGS中的TF)使CPU单步执行,即在运行一条指令后发生异常.

因此,如果您编写调试器,则可以使用CPU的单步执行功能在代码块中查找指令边界.但是只有通过运行它,并且如果它发生故障(例如,来自未映射地址的加载),您才能获得该异常,而不是TF单步异常.

(大多数OS都具有附加和单步执行另一个进程(例如Linux ptrace的功能,因此您可以创建一个无特权的沙箱进程,在其中可以逐步处理一些未知的机器代码字节...)

或者正如@Rbmn指出的那样,您可以使用操作系统辅助的调试工具来单步执行自己的工作.


@Harold和@MargaretBloom还指出,您可以将字节放在页面的末尾(后跟未映射的页面)并运行它们.查看是否收到#UD,页面错误或#GP异常.

  • #UD:解码器看到了完整但无效的指令.
  • 未映射页面上的页面错误:解码器在确定该页面为非法指令之前先将其映射到未映射页面.
  • #GP:该指令由于其他原因而有特权或错误.

要排除解码+作为完整指令运行,然后在未映射的页面上出错,请在未映射的页面之前仅从1个字节开始,并继续添加更多的字节,直到不再遇到页面错误为止.

突破克里斯托弗·多马斯(Christopher Domas)的x86 ISA ,对该技术进行了更详细的介绍,包括使用它来查找无证的非法指令,例如9a13065b8000d7是一个7字节的非法指令;那就是它停止页面错误的时候. (objdump -d只是说0x9a (bad)并解码其余字节,但是显然,真正的英特尔硬件并不满意它的坏处,直到它再获取6个字节为止).


HW性能计数器(如instructions_retired.any)也公开指令计数,但是在不了解指令末尾的情况下,您不知道将rdpmc指令放在何处.用0x90 NOP填充并查看总共执行了多少条指令可能实际上是行不通的,因为您必须知道从何处剪切并开始填充.


我想知道,为什么英特尔和AMD不为此引入指令

对于调试,通常您希望完全反汇编一条指令,而不仅仅是查找insn边界.因此,您需要一个完整的软件库.

将微码反汇编程序放在一些新的操作码后面是没有道理的.

此外,硬件解码器仅被连接起来以充当代码获取路径中前端的一部分,而不是向其馈送任意数据.他们已经在大多数周期中忙于解码指令,并且没有连接起来处理数据.添加对x86机​​器代码字节进行解码的指令几乎可以肯定是通过在ALU执行单元中复制该硬件来完成的,而不是通过查询已解码的uop缓存或L1i(在设计指令边界标记为L1i的设计中)或通过以下方式发送数据来完成的:实际的前端预解码器并捕获结果,而不是对其余前端进行排队.

我能想到的唯一真正的高性能用例是仿真,或支持诸如 https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 (还包括strstr,这是一个真正的用例,其中pcmpistri可能比SSE2或AVX2快,而strlen/strcmp则不同,其中普通的pcmpeqb/pminub如果有效地使用,可以很好地工作 (请参阅glibc的手册-无论如何,这些新指令即使在Skylake中也仍然是多指令集,并未得到广泛使用.我认为编译器很难与它们进行自动矢量化,并且大多数字符串处理都是在语言中完成的,这些语言很难以较低的开销紧密集成一些内在函数.


安装蹦床(用于热修补二进制函数.)

即使这需要解码指令,而不仅仅是找到它们的长度.

如果函数的前几个指令字节使用相对RIP寻址方式(或jcc rel8/rel32,甚至是jmpcall),则将其移至其他位置将破坏代码./strong>(感谢@Rbmn指出了这种极端情况.)

I know that there are libraries that can "parse" binary machine code / opcode to tell the length of an x86-64 CPU instruction.

But I'm wondering, since CPU has internal circuitry to determine this, is there a way to use processor itself to tell the instruction size from a binary code? (Maybe even a hack?)

解决方案

The Trap Flag (TF) in EFLAGS/RFLAGS makes the CPU single-step, i.e. take an exception after running one instruction.

So if you write a debugger, you can use the CPU's single-stepping capability to find instruction boundaries in a block of code. But only by running it, and if it faults (e.g. a load from an unmapped address) you'll get that exception instead of the TF single-step exception.

(Most OSes have facilities for attaching to and single-stepping another process, e.g. Linux ptrace, so you could maybe create an unprivileged sandbox process where your could step through some unknown bytes of machine code...)

Or as @Rbmn points out, you can use OS-assisted debug facilities to single-step yourself.


@Harold and @MargaretBloom also point out that you can put bytes at the end of a page (followed by an unmapped page) and run them. See if you get a #UD, a page fault, or a #GP exception.

  • #UD: the decoders saw a complete but invalid instruction.
  • page fault on the unmapped page: the decoders hit the unmapped page before deciding that it was an illegal instruction.
  • #GP: the instruction was privileged or faulted for other reasons.

To rule out decoding+running as a complete instruction and then faulting on the unmapped page, start with only 1 byte before the unmapped page, and keep adding more bytes until you stop getting page faults.

Breaking the x86 ISA by Christopher Domas goes into more detail about this technique, including using it to find undocumented illegal instructions, e.g. 9a13065b8000d7 is a 7-byte illegal instruction; that's when it stops page-faulting. (objdump -d just says 0x9a (bad) and decodes the rest of the bytes, but apparently real Intel hardware isn't satisfied that it's bad until it's fetched 6 more bytes).


HW performance counters like instructions_retired.any also expose instruction counts, but without knowing anything about the end of an instruction, you don't know where to put an rdpmc instruction. Padding with 0x90 NOPs and seeing how many instructions total were executed probably wouldn't really work because you'd have to know where to cut and start padding.


I'm wondering, why wouldn't Intel and AMD introduce an instruction for that

For debugging, normally you want to fully disassemble an instruction, not just find insn boundaries. So you need a full software library.

It wouldn't make sense to put a microcoded disassembler behind some new opcode.

Besides, the hardware decoders are only wired up to work as part of the front-end in the code-fetch path, not to feed them arbitrary data. They're already busy decoding instructions most cycles, and aren't wired up to work on data. Adding instructions that decode x86 machine-code bytes would almost certainly be done by replicating that hardware in an ALU execution unit, not by querying the decoded-uop cache or L1i (in designs where instruction boundaries are marked in L1i), or sending data through the actual front-end pre-decoders and capturing the result instead of queuing it for the rest of the front-end.

The only real high-performance use-case I can think of is emulation, or supporting new instructions like Intel's Software Development Emulator (SDE). But if you want to run new instructions on old CPUs, the whole point is that the old CPUs don't know about those new instructions.

The amount of CPU time spend disassembling machine code is pretty tiny compared to the amount of time that CPUs spend doing floating point math, or image processing. There's a reason we have stuff like SIMD FMA and AVX2 vpsadbw in the instruction set to speed up those special-purpose things that CPUs spend a lot of time doing, but not for stuff we can easily do with software.

Remember, the point of an instruction-set is to make it possible to create high-performance code, not to get all meta and specialize in decoding itself.

At the upper end of special-purpose complexity, the SSE4.2 string instructions were introduced in Nehalem. They can do some cool stuff, but are hard to use. https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 (also includes strstr, which is a real use-case where pcmpistri can be faster than SSE2 or AVX2, unlike for strlen / strcmp where plain old pcmpeqb / pminub works very well if used efficiently (see glibc's hand-written asm).) Anyway, these new instructions are still multi-uop even in Skylake, and aren't widely used. I think compilers have a hard time autovectorizing with them, and most string-processing is done in languages where it's not so easy to tightly integrate a few intrinsics with low overhead.


installing a trampoline (for hotpatching a binary function.)

Even this requires decoding the instructions, not just finding their length.

If the first few instruction bytes of a function used a RIP-relative addressing mode (or a jcc rel8/rel32, or even a jmp or call), moving it elsewhere will break the code. (Thanks to @Rbmn for pointing out this corner case.)

这篇关于如何使用CPU本身来判断x86-64指令操作码的长度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆