什么是组装机code,字节code和运code之间的实际关系? [英] What is the actual relation between assembly, machine code, bytecode, and opcode?

查看:286
本文介绍了什么是组装机code,字节code和运code之间的实际关系?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么是实际的组件之间的机器code字节code和运code关系,?

我已经阅读大多数关于装配和机code中的SO问题,比如<一个href=\"http://stackoverflow.com/questions/3434202/what-is-the-difference-between-native-$c$c-machine-$c$c-and-assembly-$c$c\">this,但它们是过高的水平和不显示实际装配code的例子被转化成机器code。这样一来,我还是不明白它是如何工作在更深的层次。

这个问题的理想的答案会显示一些汇编code,一个具体的例子,如下面的代码片段,每个汇编指令如何被映射到机器code,字节code和/或OP code。像这样的回答将是未来人们学习汇编很有帮助,因为在挖过去几天到目前为止,我还没有发现任何明确的总结。

我要寻找的主要事情是:


  1. 组装code的片段

  2. 机code的片段

  3. 组装和机器code的片段之间的映射(如何做到这一点的映射,或者至少是一些常见的例子,你怎么知道的怎么样要做到这一点,哪里是在网络上的所有信息)

  4. 如何跨preT机器code (如在运codeS某种联系,并在是所有关于网络上的信息什么都这些数字的的意思是的)

注:我没有计算机科学的背景,所以我刚才在过去的几年中一直持续缓慢较低的水平,并已得到现在对于想了解装配和机code点

关系大会和机code的

我目前的理解是,汇编程序(如NASM)采用装配code和从它创建机器code。

所以,当你编译一些组件,如本 example.asm

 全球主要
.text段主要:
  调用write写:
  MOV RAX,0x2000004
  MOV RDI,1
  MOV RSI,消息
  MOV RDX,长度
  系统调用段.data
消息:DB'!你好,世界为0xA
长度:EQU $ - 消息

(用编译NASM -f macho64 -o example.o example.asm )。它输出该 example.o 目标文件:

  CFFA edfe 0700 0001 0300 0000 0100 0000
0200 0000 0001 0000 0000 0000 0000 0000
1900 0000 E800 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
2e00 0000 0000 0000 2001年0000 0000 0000
2e00 0000 0000 0000 0700 0000 0700 0000
0200 0000 0000 0000 5f5f 7465 7874 0000
0000 0000 0000 0000 5f5f 5445 5854 0000
0000 0000 0000 0000 0000 0000 0000 0000
2000 0000 0000 0000 2001年0000 0000 0000
5001 0000 0100 0000 0005 0080 0000 0000
0000 0000 0000 0000 6461 5f5f 0000 7461
0000 0000 0000 0000 5f5f 4441 5441 0000
0000 0000 0000 0000 2000 0000 0000 0000
0e00 0000 0000 0000 4001 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0200 0000 1800 0000
5801 0000 0400 0000 9801 0000 0000 1C00
E800 0000 00b8开始0400 0002 0000 BF01 0048
be00 0000 0000 0000 00BA 0e00 0000 0f05
4865 6c6c 6f2c 2077 6f72 6c64 210A 0000
1100 0000 0100 0700 000E 0000 0e01 0000
0500 0000 0000 0000 0000 0d00 0000 0e02
2000 0000 0000 0000 1500 0000 0200 0000
0e00 0000 0000 0000 0100 0000 0F01 0000
0000 0000 0000 0000 0073 7461 7274 0077
7269 7465 006D 6573 7361 6765 006C 656e
6774 6800

(即 example.o 的全部内容)。当你再链接,使用 LD -o例如example.o ,它给你更多的机器code:

  CFFA edfe 0700 0001 0300 0080 0200 0000
0d00 0000 7803 0000 8500 0000 0000 0000
1900 0000 4800 0000 5041 5f5f 4745 5a45
524f 0000 0000 0000 0000 0000 0000 0000
0010 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 1900 0000 9800 0000
5f5f 5445 5854 0000 0000 0000 0000 0000
0010 0000 0000 0000 0010 0000 0000 0000
...... 523行的这个

但它是怎么从汇编指令走,这些数字?是否有某种标准的参考,列出了所有这些数字,他们的意思,因为你是在(我在OSX上使用的x86-64通过NASM)任何架构,每一组数字是如何映射到每个汇编指令?

据我所知,机code是为每台机器不同,有几十个甚至几百个不同类型的机器的。所以,我不是目前正在寻找装配如何被转化为每个人(这将是复杂的)。我只是很感兴趣,说明转型是如何工作的,任何建筑可以作为例子为例。而从这一点来说,我可以去研究具体的架构我感兴趣的是,发现的映射。

关系大会和字节$ C $℃之间(或者是所谓的运code?)

所以从我的阅读到目前为止,装配被改造成机器code正如上面展示。

但现在我感到困惑。我看到人们谈论字节code,如<一个href=\"http://stackoverflow.com/questions/27627234/how-does-jit-compilation-actually-execute-the-machine-$c$c-at-runtime/27627308#27627308\">in该SO回答,显示出这样的东西:


 无效MYFUNC(int类型的){
  的printf(%S,一);
}


  
  

此功能的装配是这样的:

  OP PARAMS OpName说明
13 82 6A PushString 82指字符串,6a为%S的地址
                     所以这个函数推栈上的指针%S。
13 83 00 PushInt 83装置整数,00表示在堆栈的顶部的之一。
                     所以这个函数获得在堆栈的顶部的整数,
                     并再次将其推入堆栈上
17 13 88 1388电话是printf的,所以这个调用printf函数
03 02流行这种弹出我们推背过堆栈的两件事
02在这时候返回到调用code。


于是我感到困惑。做一些挖掘,我不能告诉,如果其中每个2位十六进制数字像 13 82 6A 各自独立,被称为运codeS,和整套他们被称为字节code作为一个包罗万象的词。此外,我找不到一个表,列出了所有这些2位十六进制数字,以及他们的关系是该机code或组装。

要总结,我非常期待显示汇编指令是如何映射到机器code为例,它是相对于字节code和/或OP code。 (我不是在寻找一个编译器如何做到这一点,一般的映射是如何工作的)。我认为这将澄清不仅自己,而是为许多人在路上谁有兴趣学习更多关于裸机。

另外一个原因,为什么会知道可贵的是,这样一看就明白了LLVM编译器如何生成机code。他们有某种2位运算codeS或机器code 4位数的序列完整列表,并确切地知道如何映射到任何特定的架构组装?他们是从哪里得到这些信息?这个总问题的回答将使它更加清晰LLVM如何实现其code产生。

更新

从@ HansPassant的评论更新。我其实不关心实际的区别是词与词之间是什么,如果不好意思不明确。我只是想知道:如何组装映射到机器code(并在一些地方开始寻找持有在网络上的信息的参考文献),并运codeS或字节code在这个过程中任何地方使用?如果又如何?


解决方案

是的,每个建筑都有一个指令集,让指令是如何连接codeD。对于x86,它的<一个href=\"http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf\">Intel® 64和IA-32架构软件开发人员手册第2卷(2A,2B和放大器; 2C):指令集参考,A-Z

大多数汇编,包括 NASM ,可以产生一个列表文件给你。喂养你的样品code到 NASM -l <​​/ code>,我们得到:

  1全球主要
 2 .text段
 3
 4主营:
 5 00000000 E800000000调用write
 6
 7写:
 8 00000005 B804000002 MOV RAX,0x2000004
 9 0000000A BF01000000 MOV偏下,1
10 0000000F 48BE- MOV RSI,消息
11 00000011 [0000000000000000]
12 00000019 BA0E000000 MOV RDX,长度
13 0000001E 0F05系统调用
14
15段.data
16 00000000 48656C6C6F2C20776F-消息:您好!,世界分贝,为0xA
17 00000009 726C64210A
18长度:EQU $ - 消息

您可以看到生成的机器code在第三列(第一个是行号,第二个是地址)。

请注意,该汇编器的输出是对象文件,并在连接器的输出是一个可执行文件。这两个的具有复杂的结构,并且包含不仅仅是机器code以上。这就是为什么你的hexdump都可以从上面的清单是不同的。

运算code为通常被认为是机器code指令,指定要执行的操作的一部分。例如,在上述code你有 B804000002 MOV RAX,0x2000004 。有 B8 是运算code, 04000002 是立即数。

字节code未在装配上下文典型地使用的,它可以被认为是机器code代表虚拟机


有关的演练,X86是一个非常复杂的体系结构。但是,你的样品code恰好有一个简单的指令,系统调用。因此,让我们来看看如何将其转换成机器code。打开上述参考PDF格式,并转到部分约在第4章系统调用你会马上看到它列为运code 0F 05 。因为它没有采取任何操作数,我们都做了,这2个字节是机器code。我们如何把它回来?转到附录A:欧普code地图。第 A.1 告诉我们:对于2字节操作codeS与0FH开始(表A-3),跳过任何指令prefixes,所述0FH字节(0FH可以是通过66H,F2H或F3H pceded $ p $),并使用下一个运算code字节的上和下4位的值来索引表中的行和列。。好了,所以我们跳过了 0F 并分割 05 0 5 ,并查找了起来表 A-3 在行#0中,列第5。我们发现这是一个系统调用指令。

What is the actual relation between assembly, machine code, bytecode, and opcode?

I have read most of the SO questions about assembly and machine code, such as this, but they are too high level and do not show examples of actual assembly code being transformed into machine code. As a result, I still don't understand how it works at a deeper level.

The ideal answer to this question would show a specific example of some assembly code, such as the snippet below, and how each assembly instruction gets mapped to machine code, bytecode, and/or opcode. An answer like this would be very helpful to future people learning assembly, because so far in the past few days of digging I haven't found any clear summary.

The main things I am looking for are:

  1. a snippet of assembly code
  2. a snippet of machine code
  3. a mapping between the snippet of assembly and machine code (how to do that mapping, or at least some general examples, and how do you know how to do this, where is all this information on the web)
  4. how to interpret the machine code (like are opcodes somehow related, and where is all the information on the web about what all those numbers mean)

Note: I don't have a computer science background, so I have just been slowly going lower level over the past several years and have now gotten to the point of wanting to understand assembly and machine code.

Relation Between Assembly and Machine Code

My current understanding is that an "assembler" (such as NASM) takes assembly code and creates machine code from it.

So when you compile some assembly such as this example.asm:

global main
section .text

main:
  call write

write:
  mov rax, 0x2000004
  mov rdi, 1
  mov rsi, message
  mov rdx, length
  syscall

section .data
message: db 'Hello, world!', 0xa
length: equ $ - message

(compile it with nasm -f macho64 -o example.o example.asm). It outputs this example.o object file:

cffa edfe 0700 0001 0300 0000 0100 0000
0200 0000 0001 0000 0000 0000 0000 0000
1900 0000 e800 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
2e00 0000 0000 0000 2001 0000 0000 0000
2e00 0000 0000 0000 0700 0000 0700 0000
0200 0000 0000 0000 5f5f 7465 7874 0000
0000 0000 0000 0000 5f5f 5445 5854 0000
0000 0000 0000 0000 0000 0000 0000 0000
2000 0000 0000 0000 2001 0000 0000 0000
5001 0000 0100 0000 0005 0080 0000 0000
0000 0000 0000 0000 5f5f 6461 7461 0000
0000 0000 0000 0000 5f5f 4441 5441 0000
0000 0000 0000 0000 2000 0000 0000 0000
0e00 0000 0000 0000 4001 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0200 0000 1800 0000
5801 0000 0400 0000 9801 0000 1c00 0000
e800 0000 00b8 0400 0002 bf01 0000 0048
be00 0000 0000 0000 00ba 0e00 0000 0f05
4865 6c6c 6f2c 2077 6f72 6c64 210a 0000
1100 0000 0100 000e 0700 0000 0e01 0000
0500 0000 0000 0000 0d00 0000 0e02 0000
2000 0000 0000 0000 1500 0000 0200 0000
0e00 0000 0000 0000 0100 0000 0f01 0000
0000 0000 0000 0000 0073 7461 7274 0077
7269 7465 006d 6573 7361 6765 006c 656e
6774 6800 

(that is the entire contents of example.o). When you then "link" that using ld -o example example.o, it gives you more machine code:

cffa edfe 0700 0001 0300 0080 0200 0000
0d00 0000 7803 0000 8500 0000 0000 0000
1900 0000 4800 0000 5f5f 5041 4745 5a45
524f 0000 0000 0000 0000 0000 0000 0000
0010 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 1900 0000 9800 0000
5f5f 5445 5854 0000 0000 0000 0000 0000
0010 0000 0000 0000 0010 0000 0000 0000
... 523 lines of this

But how did it go from assembly instructions, to those numbers? Is there some sort of standard reference that lists out all of those numbers, and what they mean, for whatever architecture you are on (I am using x86-64 through NASM on OSX), and how each set of numbers maps to each assembly instruction?

I understand that machine code is different for every machine, and there are dozens if not hundreds of different types of machines. So I am not currently looking for how assembly gets transformed to every one (that would be complicated). I just am interested in an example that illustrates how the transformation works, and any architecture can serve as the example. And from that point, I could go and research the specific architecture I am interested in and find the mapping.

Relation Between Assembly and Bytecode (or is it called "opcode"?)

So from my reading so far, assembly gets transformed into machine code as demonstrated above.

But now I get confused. I see people talk about bytecode, such as in this SO answer, showing stuff like this:

void myfunc(int a) {
  printf("%s", a);
}

The assembly for this function would look like this:

OP Params OpName     Description
13 82 6a  PushString 82 means string, 6a is the address of "%s"
                     So this function pushes a pointer to "%s" on the stack.
13 83 00  PushInt    83 means integer, 00 means the one on the top of the stack.
                     So this function gets the integer at the top of the stack,
                     And pushes it on the stack again
17 13 88 Call        1388 is printf, so this calls the printf function
03 02    Pop         This pops the two things we pushed back off the stack
02       Return      This returns to the calling code.

So then I get confused. Doing some digging, I can't tell if each of those 2-digit hex numbers like 13 82 6a are each, individually, called "opcodes", and the whole set of them is called "bytecode" as a catch-all term. In addition, I can't find a table that lists out all of these 2-digit hex numbers, and what their relation is to machine code, or assembly.

To summarize, I am very much looking forward to an example showing how assembly instructions map to machine code, and it's relation to bytecode and/or opcode. (I am not looking for how a compiler does this, just how the general mapping works). I think this would clarify it for not only myself but for many people down the road who are interested in learning more about the bare metal.

One other reason why this would be valuable to know is, so one can understand how the LLVM compiler generates machine code. Do they have some sort of "complete list" of 2-digit opcodes or machine code 4-digit sequences, and know exactly how that maps to any architecture-specific assembly? Where did they get that information from? An answer to this overall question would make it much clearer how LLVM implemented its code generation.

Update

Updating from @HansPassant's comment. I actually don't care what the actual distinctions are between the words, sorry if that wasn't clear. I just want to know this: how does assembly map to machine code (and where are places to begin looking for the references that hold that information on the web), and are opcodes or bytecode used anywhere in that process? And if so how?

解决方案

Yes, each architecture has an instruction set reference that gives how instructions are encoded. For x86, it's the Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z

Most assemblers, including nasm, can produce a listing file for you. Feeding your sample code to nasm -l, we get:

 1                                  global main
 2                                  section .text
 3
 4                                  main:
 5 00000000 E800000000                call write
 6
 7                                  write:
 8 00000005 B804000002                mov rax, 0x2000004
 9 0000000A BF01000000                mov rdi, 1
10 0000000F 48BE-                     mov rsi, message
11 00000011 [0000000000000000]
12 00000019 BA0E000000                mov rdx, length
13 0000001E 0F05                      syscall
14
15                                  section .data
16 00000000 48656C6C6F2C20776F-     message: db 'Hello, world!', 0xa
17 00000009 726C64210A
18                                  length: equ $ - message

You can see the generated machine code in the third column (first is line number, second is address).

Note that the output of the assembler is an object file, and the output of the linker is an executable. Both of those have a complex structure and contain more than just the machine code. This is why your hexdump differs from the above listing.

Opcode is generally considered to be the part of the machine code instruction that specifies the operation to perform. For example, in the above code you have B804000002 mov rax, 0x2000004. There B8 is the opcode, 04000002 is the immediate operand.

Bytecode is not typically used in the assembly context, it could be thought of as the machine code for a virtual machine.


For a walkthrough, x86 is a very complicated architecture. But your sample code happens to have a simple instruction, the syscall. So let's see how to turn that into machine code. Open the above mentioned reference pdf, and go to the section about syscall in chapter 4. You will immediately see it listed as opcode 0F 05. Since it doesn't take any operands, we are done, those 2 bytes are the machine code. How do we turn it back? Go to Appendix A: Opcode map. Section A.1 tells us: For 2-byte opcodes beginning with 0FH (Table A-3), skip any instruction prefixes, the 0FH byte (0FH may be preceded by 66H, F2H, or F3H) and use the upper and lower 4-bit values of the next opcode byte to index table rows and columns.. Okay so we skip the 0F and split the 05 into 0 and 5 and look that up in table A-3 in row #0, column #5. We find it is a syscall instruction.

这篇关于什么是组装机code,字节code和运code之间的实际关系?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆