C ++链接在实践中如何工作? [英] How does C++ linking work in practice?
问题描述
C ++链接在实践中如何工作?我正在寻找的是关于如何链接发生的详细解释,而不是什么命令做链接。
编辑:我有已将此答案移至重复项目: http://stackoverflow.com/a/33690144/895245
这个回答集中在地址重定位,这是链接的关键功能之一。
示例将用于澄清概念。
0)简介
摘要:relocation编辑 .text
部分要翻译的物件档案:
- 物件档案地址
-
这必须由链接器完成,因为编译器只能看到一个输入文件但是我们必须立即知道所有的目标文件,以决定如何:
- 解析未定义的符号,例如声明的未定义函数 li>
- 不会冲突多个对象文件的多个
.text
和.data
li>
先决条件:了解最少:
- -64或IA-32程序集
- ELF文件的全局结构。我已了解该教程
链接与C或C ++无关:编译器只是生成目标文件。链接器然后将它们作为输入,而不知道编译它们的语言。
为了减少地壳,让我们研究一个NASM x86-64 ELF Linux hello世界:
section .data
hello_world dbHello world!,10
section .text
global _start
_start:
; sys_write
mov rax,1
mov rdi,1
mov rsi,hello_world
mov rdx,13
syscall
; sys_exit
mov rax,60
mov rdi,0
syscall
编译和汇编:
nasm -o hello_world.o hello_world.asm
ld -o hello_world.out hello_world .o
与NASM 2.10.09。
< h2> 1).text of .o
首先我们反编译对象文件的 .text
/ p>
objdump -d hello_world.o
其中:
0000000000000000 <_start>:
0:b8 01 00 00 00 mov $ 0x1,%eax
5:bf 01 00 00 00 mov $ 0x1,%edi
a:48 be 00 00 00 00 00 movabs $ 0x0,%rsi
11: 00 00
14:ba 0d 00 00 00 mov $ 0xd,%edx
19:0f 05 syscall
1b:b8 3c 00 00 00 mov $ 0x3c,%eax
20 :bf 00 00 00 00 mov $ 0x0,%edi
25:0f 05 syscall
关键的线是:
a:48 be 00 00 00 00 00 movabs $ 0x0,%rsi
11: 00 00 00
这会将hello world字符串的地址移动到 rsi
寄存器,它被传递给写系统调用。
但是等待!编译器如何能够知道当程序加载时Hello world!
在内存中的位置?
好吧,它不能,特别是在我们链接一堆 .o
文件和多个 .data
节。
只有链接器可以这样做,因为只有他将拥有所有这些目标文件。
只是:
- 在编译输出上放置一个占位符值
0x0
- 向链接器提供了一些额外的信息,指示如何使用好地址修改编译代码
额外信息包含在对象文件
的 .rela.text
部分中。2).rela.text
.rela.text
代表.text部分的重新定位。
使用字重定位是因为链接器必须将对象的地址重新定位到可执行文件中。
我们可以反汇编 .rela.text
section with:
readelf -r hello_world.o
其中包含;
重定位部分'.rela.text'在偏移量0x340包含1个条目:
偏移信息类型Sym。价值。名称+附加
00000000000c 000200000001 R_X86_64_64 0000000000000000 .data + 0
本节的格式固定记录在: http://www.sco.com /developers/gabi/2003-12-17/ch4.reloc.html
每个条目告诉链接器一个需要重定位的地址,这里我们只有一个字符串。
简化一下,对于这一行,我们有以下信息:
-
Offset = C
:.text
movabs $ 0x0,%rsi -
<$>
c $ c> Name = .data :地址指向.data
部分 -
Type = R_X86_64_64
,它指定要翻译地址的计算方式。
此字段实际上与处理器有关,因此记录在 AMD64 System V ABI扩展< a>第4.4节迁移。
该文档说
R_X86_64_64
会:
-
Field = word64
:8个字节,因此00 00 00 00 00 00 00 00
地址0xC
-
计算= S + A
-
00 00 00 00 00 00 00 00
- <$ c $ c> A 是
0
这里的加数。这是重定位条目的字段。
因此
S + A == 0
,我们将转移到.data
部分的第一个地址。 -
-
3).text的.out
现在让我们看看为我们生成的可执行文件 ld
:
objdump -d hello_world.out
给出:
code> 00000000004000b0< _start>:
4000b0:b8 01 00 00 00 mov $ 0x1,%eax
4000b5:bf 01 00 00 00 mov $ 0x1,%edi
4000ba :48 be d8 00 60 00 00 movabs $ 0x6000d8,%rsi
4000c1:00 00 00
4000c4:ba 0d 00 00 00 mov $ 0xd,%edx
4000c9:0f 05 syscall
4000cb:b8 3c 00 00 00 mov $ 0x3c,%eax
4000d0:bf 00 00 00 00 mov $ 0x0,%edi
4000d5:0f 05 syscall
所以从对象文件中改变的唯一的事情是关键的行:
4000ba:48 be d8 00 60 00 00 movabs $ 0x6000d8,%rsi
4000c1:00 00 00
现在指向地址 0x6000d8
( d8 00 60 00 00 00 00 00
in little-endian)而不是
0x0
。
code> hello_world string?
要决定我们必须检查程序头,告诉Linux在哪里加载每个节。 / p>
我们将它们分解:
readelf -l hello_world.out
其中:
code>程序头:
类型偏移VirtAddr PhysAddr
FileSiz MemSiz标志对齐
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x00000000000000d7 0x00000000000000d7 RE 200000
LOAD 0x00000000000000d8 0x00000000006000d8 0x00000000006000d8
0x000000000000000d 0x000000000000000d RW 200000
段映射的部分:
段段...
00 .text
01 .data
这告诉我们, .data
部分是第二个 VirtAddr
= 0x06000d8
。
在数据部分是我们的hello世界字符串。
How does C++ linking work in practice? What I am looking for is a detailed explanation about how the linking happens, and not what commands do the linking.
There's already a similar question about compilation which doesn't go into too much detail: How does the compilation/linking process work?
EDIT: I have moved this answer to the duplicate: http://stackoverflow.com/a/33690144/895245
This answer focuses on address relocation, which is one of the crucial functions of linking.
A minimal example will be used to clarify the concept.
0) Introduction
Summary: relocation edits the .text
section of object files to translate:
- object file address
- into the final address of the executable
This must be done by the linker because the compiler only sees one input file at a time, but we must know about all object files at once to decide how to:
- resolve undefined symbols like declared undefined functions
- not clash multiple
.text
and.data
sections of multiple object files
Prerequisites: minimal understanding of:
- x86-64 or IA-32 assembly
- global structure of an ELF file. I have made a tutorial for that
Linking has nothing to do with C or C++ specifically: compilers just generate the object files. The linker then takes them as input without ever knowing what language compiled them. It might as well be Fortran.
So to reduce the crust, let's study a NASM x86-64 ELF Linux hello world:
section .data
hello_world db "Hello world!", 10
section .text
global _start
_start:
; sys_write
mov rax, 1
mov rdi, 1
mov rsi, hello_world
mov rdx, 13
syscall
; sys_exit
mov rax, 60
mov rdi, 0
syscall
compiled and assembled with:
nasm -o hello_world.o hello_world.asm
ld -o hello_world.out hello_world.o
with NASM 2.10.09.
1) .text of .o
First we decompile the .text
section of the object file:
objdump -d hello_world.o
which gives:
0000000000000000 <_start>:
0: b8 01 00 00 00 mov $0x1,%eax
5: bf 01 00 00 00 mov $0x1,%edi
a: 48 be 00 00 00 00 00 movabs $0x0,%rsi
11: 00 00 00
14: ba 0d 00 00 00 mov $0xd,%edx
19: 0f 05 syscall
1b: b8 3c 00 00 00 mov $0x3c,%eax
20: bf 00 00 00 00 mov $0x0,%edi
25: 0f 05 syscall
the crucial lines are:
a: 48 be 00 00 00 00 00 movabs $0x0,%rsi
11: 00 00 00
which should move the address of the hello world string into the rsi
register, which is passed to the write system call.
But wait! How can the compiler possibly know where "Hello world!"
will end up in memory when the program is loaded?
Well, it can't, specially after we link a bunch of .o
files together with multiple .data
sections.
Only the linker can do that since only he will have all those object files.
So the compiler just:
- puts a placeholder value
0x0
on the compiled output - gives some extra information to the linker of how to modify the compiled code with the good addresses
This "extra information" is contained in the .rela.text
section of the object file
2) .rela.text
.rela.text
stands for "relocation of the .text section".
The word relocation is used because the linker will have to relocate the address from the object into the executable.
We can disassemble the .rela.text
section with:
readelf -r hello_world.o
which contains;
Relocation section '.rela.text' at offset 0x340 contains 1 entries:
Offset Info Type Sym. Value Sym. Name + Addend
00000000000c 000200000001 R_X86_64_64 0000000000000000 .data + 0
The format of this section is fixed documented at: http://www.sco.com/developers/gabi/2003-12-17/ch4.reloc.html
Each entry tells the linker about one address which needs to be relocated, here we have only one for the string.
Simplifying a bit, for this particular line we have the following information:
Offset = C
: what is the first byte of the.text
that this entry changes.If we look back at the decompiled text, it is exactly inside the critical
movabs $0x0,%rsi
, and those that know x86-64 instruction encoding will notice that this encodes the 64-bit address part of the instruction.Name = .data
: the address points to the.data
sectionType = R_X86_64_64
, which specifies what exactly what calculation has to be done to translate the address.This field is actually processor dependent, and thus documented on the AMD64 System V ABI extension section 4.4 "Relocation".
That document says that
R_X86_64_64
does:Field = word64
: 8 bytes, thus the00 00 00 00 00 00 00 00
at address0xC
Calculation = S + A
S
is value at the address being relocated, thus00 00 00 00 00 00 00 00
A
is the addend which is0
here. This is a field of the relocation entry.
So
S + A == 0
and we will get relocated to the very first address of the.data
section.
3) .text of .out
Now lets look at the text area of the executable ld
generated for us:
objdump -d hello_world.out
gives:
00000000004000b0 <_start>:
4000b0: b8 01 00 00 00 mov $0x1,%eax
4000b5: bf 01 00 00 00 mov $0x1,%edi
4000ba: 48 be d8 00 60 00 00 movabs $0x6000d8,%rsi
4000c1: 00 00 00
4000c4: ba 0d 00 00 00 mov $0xd,%edx
4000c9: 0f 05 syscall
4000cb: b8 3c 00 00 00 mov $0x3c,%eax
4000d0: bf 00 00 00 00 mov $0x0,%edi
4000d5: 0f 05 syscall
So the only thing that changed from the object file are the critical lines:
4000ba: 48 be d8 00 60 00 00 movabs $0x6000d8,%rsi
4000c1: 00 00 00
which now point to the address 0x6000d8
(d8 00 60 00 00 00 00 00
in little-endian) instead of 0x0
.
Is this the right location for the hello_world
string?
To decide we have to check the program headers, which tell Linux where to load each section.
We disassemble them with:
readelf -l hello_world.out
which gives:
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x00000000000000d7 0x00000000000000d7 R E 200000
LOAD 0x00000000000000d8 0x00000000006000d8 0x00000000006000d8
0x000000000000000d 0x000000000000000d RW 200000
Section to Segment mapping:
Segment Sections...
00 .text
01 .data
This tells us that the .data
section, which is the second one, starts at VirtAddr
= 0x06000d8
.
And the only thing on the data section is our hello world string.
这篇关于C ++链接在实践中如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!