代码如何以EXE格式存储? [英] How is code stored in the EXE format?

查看:186
本文介绍了代码如何以EXE格式存储?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题如下:
$ b $ ol

  • 可移植可执行格式(在windows / unix上)与x86 / x64指令设置一般?

  • PE格式是否存储处理器支持的操作码的确切集合,还是操作系统转换为与CPU匹配的更一般格式?

  • EXE文件如何指示所需的指令集扩展(如3DNOW!或SSE / MMX?)
  • 操作码在所有平台(如Windows,Mac和unix?

  • 英特尔i386兼容的CPU芯片,如英特尔和AMD的芯片使用一个共同的指令集。但是我确定ARM供电的CPU使用不同的操作码。这些非常非常不同或者是相似的概念?注册表,int /浮点数/双精度型,SIMD等?

  • 在.NET,Java或Flash等新平台上,指令集是JIT在运行时转换为本机格式的基于栈的操作码。习惯于这种格式,我想知道如何执行和格式化旧本机EXE格式。例如,在较新的平台操作码中,寄存器通常是不可用的,因为JIT将堆栈命令转换为16/32可用的CPU寄存器(如果有必要)。但是在原生格式中,您需要通过索引来引用寄存器,并计算出哪些寄存器可以被重复使用以及如何频繁使用。

    解决方案

    ARM操作码与x86操作码有很大不同吗?



    是的,它们是。您应该假设不同处理器系列的所有指令集完全不同且不兼容。一个指令集首先定义了一个编码,它指定了以下一个或多个:


    • 指令操作码

    • 寻址模式;

    • 操作数大小;
    • 地址大小;

    • 操作数。


      编码进一步取决于它可以处理多少个寄存器,是否必须向后兼容,如果必须复杂性:ARM指令集要求将所有的操作数从存储器加载到寄存器并存储到寄存器内存中使用专门的加载/存储指令,而x86指令可以将单个存储器地址编码为它们的一个操作数,因此不具有单独的加载/存储指令。设置本身:不同的处理器将有专门的指令来处理特定的情况。即使两个处理器系列具有相同的指令(例如, add 指令),它们编码的方式也会有很大不同,并且可能会有稍微不同的语义。如您所见,由于任何CPU设计者都可以决定所有这些因素,这使得不同处理器系列的指令集架构完全不同且不兼容。

      在不同的体系结构中,寄存器,int / float / double和SIMD有非常不同的概念吗?

      不,它们非常相似。每个现代架构都有寄存器并可以处理整数,而且大多数处理器都可以处理IEEE 754兼容的浮点指令。例如,x86体系结构具有80位浮点值,这些浮点值被截断以适合您所知道的32位或64位浮点值。 SIMD指令背后的思想在支持它的所有体系结构上也是一样的,但很多不支持,大多数对它们有不同的要求或限制。

      操作码在Windows,Mac和Unix等所有平台上都是如此?


      $ b p

      如果有三个Intel x86系统,一个运行Windows,一个运行Mac OS X,一个运行Unix / em> yes 由于它们在同一处理器上运行,所以操作码完全相同。但是,每个操作系统是不同的。内存分配,图形,设备驱动程序接口和线程等许多方面都需要操作系统特定的代码。所以你通常不能在Linux上运行一个为Windows编译的可执行文件。

      PE格式是否存储了处理器支持的操作码的确切集合,还是更多操作系统转换为与CPU匹配的通用格式?



      不,PE格式不存储操作码集。如前所述,不同处理器系列的指令集体系结构完全不同,因此无法实现这一点。一个PE文件通常存储一个特定处理器系列和操作系统系列的机器码,并且只能运行在这样的处理器和操作系统上。

      然而有一个例外:。 NET程序集也是PE文件,但它们包含不是特定于任何处理器或操作系统的通用指令。这样的PE文件可以在其他系统上运行,但不能直接运行。例如,Linux上的 mono 可以运行这样的.NET程序集。



      EXE文件如何指示需要的指令集扩展(如3DNOW!或者SSE / MMX?)



      虽然可执行文件可以指示它的构建指令集(见克里斯多德的答案),我不相信可执行文件可以指示所需的扩展。但是,可执行代码在运行时可以检测到这样的扩展。例如,x86指令集具有 CPUID 指令,该指令返回特定CPU支持的所有扩展和功能。当处理器不符合要求时,可执行文件只会测试并中止。

      .NET与本地代码



      <你似乎知道一些关于.NET程序集及其指令集的东西,称为CIL(通用中间语言)。每个CIL指令都遵循特定的编码,并将评估堆栈用于其操作数。 CIL指令集保持非常普遍和高水平。当它运行时(在Windows上由 mscoree.dll ,在Linux上由 mono )运行并调用一个方法,即时(JIT)编译器采用该方法的CIL指令并将其编译为机器码。根据操作系统和处理器系列的不同,编译器必须决定使用哪些机器指令以及如何对它们进行编码。编译结果存储在内存中的某处。下一次该方法被调用时,代码直接跳转到编译好的机器代码,并且可以像原生可执行文件一样高效执行。



      ARM指令是如何编码的? h3>

      我从来没有和ARM合作过,但是快速浏览一下文档,我可以告诉你下面的内容。 ARM指令总是32位。有许多特殊的编码(例如分支和协处理器指令),但是ARM指令的一般格式如下所示:

       
      31 28 27 26 25 21 20 16
      + --- + --- + --- + --- + --- + --- + --- + --- + --- + --- + --- + --- + --- + --- + --- + --- + -
      |条件| 0 | 0 | R / I |操作码| S |操作数1 | ...
      + --- + --- + --- + --- + --- + --- + --- + --- + --- + --- + - + - + --- + --- + --- + --- + --- + -

      12 0
      - + --- + --- + - - + --- + --- + --- + --- + --- + --- + --- + --- + --- + --- + --- + --- + --- +
      ... |目的地|操作数2 |
      - + --- + --- + --- + --- + --- + --- + --- + --- + --- + --- + --- + --- + --- + --- + --- + --- +

      这些字段表示如下:


      • 条件:当为真时,会导致指令执行。这看起来是零,进位,负数和溢出标志。当设置为1110时,总是执行该指令。

      • R / I :当0时,操作数2 是一个寄存器。当操作数1为1时,操作数2 为常量。

      • 操作码:指令的操作码。 li> S :1时,根据指令的结果设置零,进位,负数和溢出标志。 :用作第一个操作数的寄存器的索引。
      • 目标:用作目标操作数的寄存器的索引。
      • >
      • 操作数2 :第二个操作数。当 R / I 是0时,寄存器的索引。当 R / I 是1时,是一个无符号的8位常量值。除了其中之一外,操作数2中的一些位还表示值是否被移位/旋转。


        有关更详细的信息,应该阅读你想了解的特定ARM版本的文档。我使用了这个 ARM7TDMI-S数据手册,第4章< a>。

        请注意,每条ARM指令无论多么简单,都需要4个字节进行编码。由于可能的开销,现代ARM处理器允许您使用称为 Thumb 的替代16位指令集。它不能表达所有的32位指令集,但它也是一半大。



        另一方面,x86-64指令具有可变长度编码,并使用各种修改器来调整个别指令的行为。如果要将ARM指令与x86和x86-64指令的编码方式进行比较,请阅读 x86 -64指令编码文章,我写在OSDev.org上。






        你原来的问题是非常广泛的。如果你想知道更多,你应该做一些研究,并创建一个新的问题,你想知道的具体事情。

        My questions are as follows:

        1. How does the Portable Executable format (on windows/unix) relate to the x86/x64 instruction set in general?
        2. Does the PE format store the exact set of opcodes supported by the processor, or is it a more generic format that the OS converts to match the CPU?
        3. How does the EXE file indicate the instruction set extensions needed (like 3DNOW! or SSE/MMX?)
        4. Are the opcodes common across all platforms like Windows, Mac and unix?
        5. Intel i386 compatible CPU chips like ones from Intel and AMD use a common instruction set. But I'm sure ARM-powered CPUs use different opcodes. Are these very very different or are the concepts similar? registers, int/float/double, SIMD, etc?

        On newer platforms like .NET, Java or Flash, the instruction sets are stack-based opcodes that a JIT converts to the native format at runtime. Being accustomed to such a format I'd like to know how the "old" native EXE format is executed and formatted. For example, "registers" are usually unavailable in newer platform opcodes, since the JIT converts stack commands to the 16/32 available CPU registers as it deems necessary. But in native formats you need to refer to registers by index, and work out which registers can be reused and how often.

        解决方案

        Are ARM opcodes very different from x86 opcodes?

        Yes, they are. You should assume that all instruction sets for different processor families are completely different and incompatible. An instruction set first defines an encoding, which specifies one or more of these:

        • the instruction opcode;
        • the addressing mode;
        • the operand size;
        • the address size;
        • the operands themselves.

        The encoding further depends on how many registers it can address, whether it has to be backwards compatible, if it has to be decodable quickly, and how complex the instruction can be.

        On the complexity: the ARM instruction set requires all operands to be loaded from memory to register and stored from register to memory using specialized load/store instructions, whereas x86 instructions can encode a single memory address as one of their operands and therefore do not have separate load/store instructions.

        Then the instruction set itself: different processors will have specialized instructions to deal with specific situations. Even if two processors families have the same instruction for the same thing (e.g. an add instruction), they are encoded very differently and may have slightly different semantics.

        As you see, since any CPU designer can decide on all these factors, this makes the instruction set architectures for different processor families completely different and incompatible.

        Are registers, int/float/double and SIMD very different concepts on different architectures?

        No they are very similar. Every modern architecture has registers and can handle integers, and most can handle IEEE 754 compatible floating-point instructions of some size. For example, the x86 architecture has 80-bit floating-point values that are truncated to fit the 32-bit or 64-bit floating-point values you know. The idea behind SIMD instructions is also the same on all architectures that support it, but many do not support it and most have different requirements or restrictions for them.

        Are the opcodes common across all platforms like Windows, Mac and Unix?

        Given three Intel x86 systems, one running Windows, one running Mac OS X and one running Unix/Linux, then yes the opcodes are exactly the same since they run on the same processor. However, each operating system is different. Many aspects such as memory allocation, graphics, device driver interfacing and threading require operating system specific code. So you generally can't run an executable compiled for Windows on Linux.

        Does the PE format store the exact set of opcodes supported by the processor, or is it a more generic format that the OS converts to match the CPU?

        No, the PE format does not store the set of opcodes. As explained earlier, the instruction set architectures of different processor families are simply too different to make this possible. A PE file usually stores machine code for one specific processor family and operating system family, and will only run on such processors and operating systems.

        There is however one exception: .NET assemblies are also PE files but they contain generic instructions that are not specific to any processor or operating system. Such PE files can be 'run' on other systems, but not directly. For example, mono on Linux can run such .NET assemblies.

        How does the EXE file indicate the instruction set extensions needed (like 3DNOW! or SSE/MMX?)

        While the executable can indicate the instruction set for which it was built (see Chris Dodd's answer), I don't believe the executable can indicate the extensions that are required. However, the executable code, when run, can detect such extensions. For example, the x86 instruction set has a CPUID instruction that returns all the extensions and features supported by that particular CPU. The executable would just test that and abort when the processor does not meet the requirements.

        .NET versus native code

        You seem to know a thing or two about .NET assemblies and their instruction set, called CIL (Common Intermediate Language). Each CIL instruction follows a specific encoding and uses the evaluation stack for its operands. The CIL instruction set is kept very general and high-level. When it is run (on Windows by mscoree.dll, on Linux by mono) and a method is called, the Just-In-Time (JIT) compiler takes the method's CIL instructions and compiles them to machine code. Depending on the operating system and processor family the compiler has to decide which machine instructions to use and how to encode them. The compiled result is stored somewhere in memory. The next time the method is called the code jumps directly to the compiled machine code and can execute just as efficiently as a native executable.

        How are ARM instructions encoded?

        I have never worked with ARM, but from a quick glance at the documentation I can tell you the following. An ARM instruction is always 32-bits in length. There are many exceptional encodings (e.g. for branching and coprocessor instructions), but the general format of an ARM instruction is like this:

        31             28  27  26  25              21  20              16
        +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+--
        |   Condition   | 0 | 0 |R/I|    Opcode     | S |   Operand 1   | ...
        +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+--
        
                           12                                               0
          --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
        ... |  Destination  |               Operand 2                       |
          --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
        

        The fields mean the following:

        • Condition: A condition that, when true, causes the instruction to be executed. This looks at the Zero, Carry, Negative and Overflow flags. When set to 1110, the instruction is always executed.
        • R/I: When 0, operand 2 is a register. When 1, operand 2 is a constant value.
        • Opcode: The instruction's opcode.
        • S: When 1, the Zero, Carry, Negative and Overflow flags are set according to the instruction's result.
        • Operand1: The index of a register that is used as the first operand.
        • Destination: The index of a register that is used as the destination operand.
        • Operand 2: The second operand. When R/I is 0, the index of a register. When R/I is 1, an unsigned 8-bit constant value. In addition to either one of these, some bits in operand 2 indicate whether the value is shifted/rotated.

        For more detailed information you should read the documentation for the specific ARM version you want to know about. I used this ARM7TDMI-S Data Sheet, Chapter 4 for this example.

        Note that each ARM instruction, no matter how simple, takes 4 bytes to encode. Because of the possible overhead, the modern ARM processors allow you to use an alternative 16-bit instruction set called Thumb. It cannot express all the things the 32-bit instruction set can, but it is also half as big.

        On the other hand, x86-64 instructions have a variable length encoding, and use all kinds of modifiers to adjust the behavior of individual instructions. If you want to compare the ARM instructions with how x86 and x86-64 instructions are encoded, you should read the x86-64 Instruction Encoding article that I wrote on OSDev.org.


        Your original question is very broad. If you want to know more, you should do some research and create a new question with the specific thing you want to know.

        这篇关于代码如何以EXE格式存储?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆