汇编、机器码、字节码和操作码之间的实际关系是什么? [英] What is the actual relation between assembly, machine code, bytecode, and opcode?

查看:192
本文介绍了汇编、机器码、字节码和操作码之间的实际关系是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

汇编、机器码、字节码和操作码之间的实际关系是什么?

我已经阅读了大部分关于汇编和机器代码的 SO 问题,例如 this,但它们的级别太高,并且没有显示将实际汇编代码转换为机器代码的示例.结果,我仍然不明白它在更深层次上是如何工作的.

这个问题的理想答案是展示一些汇编代码的具体示例,例如下面的代码片段,以及每个汇编指令如何映射到机器码、字节码和/或操作码.这样的回答对以后学习汇编的人很有帮助,因为这几天的挖掘至今没有找到明确的总结.

我正在寻找的主要内容是:

  1. 一段汇编代码
  2. 一段机器代码
  3. 程序集片段和机器代码之间的映射(如何进行该映射,或者至少是一些一般示例,以及您如何知道如何做到这一点,网络上的所有这些信息在哪里)
  4. 如何解释机器码(比如操作码是否以某种方式相关,网络上关于所有这些数字的含义的所有信息在哪里表示)

注意:我没有计算机科学背景,所以在过去的几年里我只是慢慢地降低了水平,现在已经到了想了解汇编和机器代码的地步.

汇编和机器码之间的关系

我目前的理解是汇编器"(例如 NASM)获取汇编代码并从中创建机器代码.

所以当你编译一些程序集比如这个 example.asm:

全局主要部分 .text主要的:调用写写:mov rax, 0x2000004mov rdi, 1mov rsi, 消息mov rdx, 长度系统调用节 .data消息:db '你好,世界!',0xa长度:equ $ - 消息

(使用 nasm -f macho64 -o example.o example.asm 编译它).它输出这个 example.o 对象文件:

cffa edfe 0700 0001 0300 0000 0100 00000200 0000 0001 0000 0000 0000 0000 00001900 0000 e800 0000 0000 0000 0000 00000000 0000 0000 0000 0000 0000 0000 00002e00 0000 0000 0000 2001 0000 0000 00002e00 0000 0000 0000 0700 0000 0700 00000200 0000 0000 0000 5f5f 7465 7874 00000000 0000 0000 0000 5f5f 5445 5854 00000000 0000 0000 0000 0000 0000 0000 00002000 0000 0000 0000 2001 0000 0000 00005001 0000 0100 0000 0005 0080 0000 00000000 0000 0000 0000 5f5f 6461 7461 00000000 0000 0000 0000 5f5f 4441 5441 00000000 0000 0000 0000 2000 0000 0000 00000e00 0000 0000 0000 4001 0000 0000 00000000 0000 0000 0000 0000 0000 0000 00000000 0000 0000 0000 0200 0000 1800 00005801 0000 0400 0000 9801 0000 1c00 0000e800 0000 00b8 0400 0002 bf01 0000 0048be00 0000 0000 0000 00ba 0e00 0000 0f054865 6c6c 6f2c 2077 6f72 6c64 210a 00001100 0000 0100 000e 0700 0000 0e01 00000500 0000 0000 0000 0d00 0000 0e02 00002000 0000 0000 0000 1500 0000 0200 00000e00 0000 0000 0000 0100 0000 0f01 00000000 0000 0000 0000 0073 7461 7274 00777269 7465 006d 6573 7361 6765 006c 656e6774 6800

(即example.o的全部内容).然后,当您使用 ld -o example example.o链接"它时,它会为您提供更多机器代码:

cffa edfe 0700 0001 0300 0080 0200 00000d00 0000 7803 0000 8500 0000 0000 00001900 0000 4800 0000 5f5f 5041 4745 5a45524f 0000 0000 0000 0000 0000 0000 00000010 0000 0000 0000 0000 0000 0000 00000000 0000 0000 0000 0000 0000 0000 00000000 0000 0000 0000 1900 0000 9800 00005f5f 5445 5854 0000 0000 0000 0000 00000010 0000 0000 0000 0010 0000 0000 0000... 523 行这个

但是它是如何从组装说明变成那些数字的呢?是否有某种标准参考列出了所有这些数字,以及它们的含义,无论您使用什么架构(我在 OSX 上通过 NASM 使用 x86-64),以及每组数字如何映射到每条汇编指令?

我了解每台机器的机器代码都不同,并且有几十种甚至数百种不同类型的机器.所以我目前不是在寻找组装如何转化为每一个(这会很复杂).我只是对一个说明转换如何工作的示例感兴趣,任何架构都可以作为示例.从那时起,我可以去研究我感兴趣的特定架构并找到映射.

Assembly 和 Bytecode 之间的关系(或者称为操作码"?)

所以从我目前的阅读来看,汇编被转换为机器代码,如上所示.

但现在我很困惑.我看到人们谈论字节码,例如 在这个 SO 答案中,显示如下内容:

<块引用>

void myfunc(int a) {printf("%s", a);}

此函数的程序集如下所示:

OP 参数 OpName 描述13 82 6a PushString 82 表示字符串,6a 是%s"的地址所以这个函数将一个指向%s"的指针压入堆栈.13 83 00 PushInt 83 表示整数,00 表示栈顶的那个.所以这个函数获取栈顶的整数,并再次将其推入堆栈17 13 88 调用 1388 是 printf,所以这调用了 printf 函数03 02 Pop 这会将我们推回堆栈的两个东西弹出02 返回 返回调用代码.

然后我就糊涂了.进行一些挖掘,我无法判断像 13 82 6a 这样的 2 位十六进制数字中的每一个是否分别称为操作码",而它们的整个集合称为字节码"作为一个包罗万象的术语.此外,我找不到列出所有这些 2 位十六进制数字的表格,以及它们与机器代码或汇编的关系.

总而言之,我非常期待一个示例,展示汇编指令如何映射到机器码,以及它与字节码和/或操作码的关系.(我不是在寻找编译器是如何做到这一点的,而是在寻找一般映射是如何工作的).我认为这不仅对我自己而且对许多有兴趣了解更多关于裸机的人来说都会澄清.

了解这一点很有价值的另一个原因是,人们可以了解 LLVM 编译器如何生成机器代码.他们是否有某种完整列表"的 2 位操作码或机器代码 4 位序列,并且确切地知道它如何映射到任何特定于体系结构的程序集?他们从哪里得到这些信息?对这个整体问题的回答将使 LLVM 如何实现其代码生成更加清晰.

更新

更新 @HansPassant 的评论.我实际上不在乎这两个词之间的实际区别是什么,如果不清楚,抱歉.我只想知道这一点:程序集如何映射到机器代码(以及从哪里开始寻找在网络上保存该信息的引用),以及在该过程中的任何地方都使用了操作码或字节码吗?如果是的话怎么办?

解决方案

是的,每个架构都有一个指令集参考,说明指令是如何编码的.对于 x86,它是 英特尔® 64 和 IA-32 架构软件开发人员手册第 2 卷(2A、2B 和 2C):指令集参考,AZ

大多数汇编程序,包括 nasm,都可以为您生成列表文件.将您的示例代码提供给 nasm -l,我们得到:

 1 全局主2 节 .text34 主要:5 00000000 E800000000 调用写入67 写:8 00000005 B804000002 移动 rax, 0x20000049 0000000A BF01000000 MOV RDI, 110 0000000F 48BE-mov rsi,消息11 00000011 [0000000000000000]12 00000019 BA0E000000 mov rdx,长度13 0000001E 0F05 系统调用1415节.data16 00000000 48656C6C6F2C20776F-消息:db'你好,世界!',0xa17 00000009 726C64210A18 长度:equ $ - 消息

在第三列可以看到生成的机器码(第一列是行号,第二列是地址).

请注意,汇编器的输出是一个目标文件,而链接器的输出是一个可执行文件.两者都具有复杂的结构,并且包含的​​不仅仅是机器代码.这就是为什么您的 hexdump 与上述列表不同的原因.

操作码通常被认为是指定要执行的操作的机器代码指令的一部分.例如,在上面的代码中,您有 B804000002 mov rax, 0x2000004.其中B8是操作码,04000002是立即操作数.

字节码通常不用于汇编上下文,它可以被认为是虚拟机的机器码.

<小时>

对于演练,x86 是一个非常复杂的架构.但是您的示例代码恰好有一个简单的指令,即 syscall.那么让我们看看如何将其转换为机器代码.打开上面提到的参考pdf,然后到第4章关于syscall的部分.你会立即看到它被列为操作码0F 05.由于它不需要任何操作数,我们就完成了,这 2 个字节是机器码.我们如何把它转回来?转到附录 A:操作码映射.A.1 节告诉我们:对于以 0FH(表 A-3)开头的 2 字节操作码,跳过任何指令前缀,即 0FH 字节(0FH 前面可能有 66H、F2H、或 F3H) 并使用下一个操作码字节的高 4 位值和低 4 位值来索引表的行和列..好的,所以我们跳过 0F 并将 05 拆分为 05 并在表 中查找>A-3 在第 0 行第 5 列.我们发现它是一个 syscall 指令.

What is the actual relation between assembly, machine code, bytecode, and opcode?

I have read most of the SO questions about assembly and machine code, such as this, but they are too high level and do not show examples of actual assembly code being transformed into machine code. As a result, I still don't understand how it works at a deeper level.

The ideal answer to this question would show a specific example of some assembly code, such as the snippet below, and how each assembly instruction gets mapped to machine code, bytecode, and/or opcode. An answer like this would be very helpful to future people learning assembly, because so far in the past few days of digging I haven't found any clear summary.

The main things I am looking for are:

  1. a snippet of assembly code
  2. a snippet of machine code
  3. a mapping between the snippet of assembly and machine code (how to do that mapping, or at least some general examples, and how do you know how to do this, where is all this information on the web)
  4. how to interpret the machine code (like are opcodes somehow related, and where is all the information on the web about what all those numbers mean)

Note: I don't have a computer science background, so I have just been slowly going lower level over the past several years and have now gotten to the point of wanting to understand assembly and machine code.

Relation Between Assembly and Machine Code

My current understanding is that an "assembler" (such as NASM) takes assembly code and creates machine code from it.

So when you compile some assembly such as this example.asm:

global main
section .text

main:
  call write

write:
  mov rax, 0x2000004
  mov rdi, 1
  mov rsi, message
  mov rdx, length
  syscall

section .data
message: db 'Hello, world!', 0xa
length: equ $ - message

(compile it with nasm -f macho64 -o example.o example.asm). It outputs this example.o object file:

cffa edfe 0700 0001 0300 0000 0100 0000
0200 0000 0001 0000 0000 0000 0000 0000
1900 0000 e800 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
2e00 0000 0000 0000 2001 0000 0000 0000
2e00 0000 0000 0000 0700 0000 0700 0000
0200 0000 0000 0000 5f5f 7465 7874 0000
0000 0000 0000 0000 5f5f 5445 5854 0000
0000 0000 0000 0000 0000 0000 0000 0000
2000 0000 0000 0000 2001 0000 0000 0000
5001 0000 0100 0000 0005 0080 0000 0000
0000 0000 0000 0000 5f5f 6461 7461 0000
0000 0000 0000 0000 5f5f 4441 5441 0000
0000 0000 0000 0000 2000 0000 0000 0000
0e00 0000 0000 0000 4001 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0200 0000 1800 0000
5801 0000 0400 0000 9801 0000 1c00 0000
e800 0000 00b8 0400 0002 bf01 0000 0048
be00 0000 0000 0000 00ba 0e00 0000 0f05
4865 6c6c 6f2c 2077 6f72 6c64 210a 0000
1100 0000 0100 000e 0700 0000 0e01 0000
0500 0000 0000 0000 0d00 0000 0e02 0000
2000 0000 0000 0000 1500 0000 0200 0000
0e00 0000 0000 0000 0100 0000 0f01 0000
0000 0000 0000 0000 0073 7461 7274 0077
7269 7465 006d 6573 7361 6765 006c 656e
6774 6800 

(that is the entire contents of example.o). When you then "link" that using ld -o example example.o, it gives you more machine code:

cffa edfe 0700 0001 0300 0080 0200 0000
0d00 0000 7803 0000 8500 0000 0000 0000
1900 0000 4800 0000 5f5f 5041 4745 5a45
524f 0000 0000 0000 0000 0000 0000 0000
0010 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 1900 0000 9800 0000
5f5f 5445 5854 0000 0000 0000 0000 0000
0010 0000 0000 0000 0010 0000 0000 0000
... 523 lines of this

But how did it go from assembly instructions, to those numbers? Is there some sort of standard reference that lists out all of those numbers, and what they mean, for whatever architecture you are on (I am using x86-64 through NASM on OSX), and how each set of numbers maps to each assembly instruction?

I understand that machine code is different for every machine, and there are dozens if not hundreds of different types of machines. So I am not currently looking for how assembly gets transformed to every one (that would be complicated). I just am interested in an example that illustrates how the transformation works, and any architecture can serve as the example. And from that point, I could go and research the specific architecture I am interested in and find the mapping.

Relation Between Assembly and Bytecode (or is it called "opcode"?)

So from my reading so far, assembly gets transformed into machine code as demonstrated above.

But now I get confused. I see people talk about bytecode, such as in this SO answer, showing stuff like this:

void myfunc(int a) {
  printf("%s", a);
}

The assembly for this function would look like this:

OP Params OpName     Description
13 82 6a  PushString 82 means string, 6a is the address of "%s"
                     So this function pushes a pointer to "%s" on the stack.
13 83 00  PushInt    83 means integer, 00 means the one on the top of the stack.
                     So this function gets the integer at the top of the stack,
                     And pushes it on the stack again
17 13 88 Call        1388 is printf, so this calls the printf function
03 02    Pop         This pops the two things we pushed back off the stack
02       Return      This returns to the calling code.

So then I get confused. Doing some digging, I can't tell if each of those 2-digit hex numbers like 13 82 6a are each, individually, called "opcodes", and the whole set of them is called "bytecode" as a catch-all term. In addition, I can't find a table that lists out all of these 2-digit hex numbers, and what their relation is to machine code, or assembly.

To summarize, I am very much looking forward to an example showing how assembly instructions map to machine code, and it's relation to bytecode and/or opcode. (I am not looking for how a compiler does this, just how the general mapping works). I think this would clarify it for not only myself but for many people down the road who are interested in learning more about the bare metal.

One other reason why this would be valuable to know is, so one can understand how the LLVM compiler generates machine code. Do they have some sort of "complete list" of 2-digit opcodes or machine code 4-digit sequences, and know exactly how that maps to any architecture-specific assembly? Where did they get that information from? An answer to this overall question would make it much clearer how LLVM implemented its code generation.

Update

Updating from @HansPassant's comment. I actually don't care what the actual distinctions are between the words, sorry if that wasn't clear. I just want to know this: how does assembly map to machine code (and where are places to begin looking for the references that hold that information on the web), and are opcodes or bytecode used anywhere in that process? And if so how?

解决方案

Yes, each architecture has an instruction set reference that gives how instructions are encoded. For x86, it's the Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z

Most assemblers, including nasm, can produce a listing file for you. Feeding your sample code to nasm -l, we get:

 1                                  global main
 2                                  section .text
 3
 4                                  main:
 5 00000000 E800000000                call write
 6
 7                                  write:
 8 00000005 B804000002                mov rax, 0x2000004
 9 0000000A BF01000000                mov rdi, 1
10 0000000F 48BE-                     mov rsi, message
11 00000011 [0000000000000000]
12 00000019 BA0E000000                mov rdx, length
13 0000001E 0F05                      syscall
14
15                                  section .data
16 00000000 48656C6C6F2C20776F-     message: db 'Hello, world!', 0xa
17 00000009 726C64210A
18                                  length: equ $ - message

You can see the generated machine code in the third column (first is line number, second is address).

Note that the output of the assembler is an object file, and the output of the linker is an executable. Both of those have a complex structure and contain more than just the machine code. This is why your hexdump differs from the above listing.

Opcode is generally considered to be the part of the machine code instruction that specifies the operation to perform. For example, in the above code you have B804000002 mov rax, 0x2000004. There B8 is the opcode, 04000002 is the immediate operand.

Bytecode is not typically used in the assembly context, it could be thought of as the machine code for a virtual machine.


For a walkthrough, x86 is a very complicated architecture. But your sample code happens to have a simple instruction, the syscall. So let's see how to turn that into machine code. Open the above mentioned reference pdf, and go to the section about syscall in chapter 4. You will immediately see it listed as opcode 0F 05. Since it doesn't take any operands, we are done, those 2 bytes are the machine code. How do we turn it back? Go to Appendix A: Opcode map. Section A.1 tells us: For 2-byte opcodes beginning with 0FH (Table A-3), skip any instruction prefixes, the 0FH byte (0FH may be preceded by 66H, F2H, or F3H) and use the upper and lower 4-bit values of the next opcode byte to index table rows and columns.. Okay so we skip the 0F and split the 05 into 0 and 5 and look that up in table A-3 in row #0, column #5. We find it is a syscall instruction.

这篇关于汇编、机器码、字节码和操作码之间的实际关系是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆