Linux内存分段 [英] Linux memory segmentation

查看:67
本文介绍了Linux内存分段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在研究Linux和内存管理的内部结构时,我偶然发现了Linux使用的分段分页模型.

如果我错了,请纠正我,但是Linux(保护模式)确实使用分页将线性虚拟地址空间映射到物理地址空间.由页面组成的线性地址空间被分为四个部分,用于处理平面内存模型,即:

  • 内核代码段( __ KERNEL_CS );
  • 内核数据段( __ KERNEL_DS );
  • 用户代码段( __ USER_CS );
  • 用户数据段( __ USER_DS );

存在第五个内存段,称为Null段,但未使用.

这些段的CPL(当前特权级别)为0(主管)或3(用户权限).

为简单起见,我将集中讨论32位内存映射,其中4GiB可寻址空间,3GiB用于用户空间进程空间(以绿色显示),1GiB用于主管内核空间(以红色显示).:

因此红色部分由两个部分 __ KERNEL_CS __ KERNEL_DS 组成,绿色部分由两个部分 __ USER_CS __ USER_DS .

这些段彼此重叠.分页将用于用户区和内核隔离.

但是,摘录自Wikipedia 此处:

[...]许多32位操作系统通过将所有段的基数都设置为0来模拟平面内存模型,以使分段对程序无关.

查看GDT的linux内核代码这里:

  [GDT_ENTRY_KERNEL32_CS] = GDT_ENTRY_INIT(0xc09b,0,0xfffff),[GDT_ENTRY_KERNEL_CS] = GDT_ENTRY_INIT(0xa09b,0,0xfffff),[GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(0xc093, 0, 0xfffff),[GDT_ENTRY_DEFAULT_USER32_CS] = GDT_ENTRY_INIT(0xc0fb,0,0xfffff),[GDT_ENTRY_DEFAULT_USER_DS] = GDT_ENTRY_INIT(0xc0f3,0,0xfffff),[GDT_ENTRY_DEFAULT_USER_CS] = GDT_ENTRY_INIT(0xa0fb,0,0xfffff), 

正如Peter指出的那样,每个段都从0开始,但是那些标志是什么,即 0xc09b 0xa09b 等?我倾向于认为它们是段选择器,如果不是,那么如果它们的寻址空间都从0开始,我将如何从内核段访问userland段?

不使用分段.仅使用分页.段的 seg_base 地址设置为0,将其空间扩展到 0xFFFFF ,从而提供完整的线性地址空间.这意味着逻辑地址与线性地址没有什么不同.

此外,由于所有段彼此重叠,是提供内存保护(即内存分离)的分页单元吗?

分页提供保护,而不是分段.内核将检查线性地址空间,并根据边界(通常称为 TASK_MAX )检查特权.所请求页面的级别.

解决方案

是的,Linux使用分页,因此所有地址始终都是虚拟的.(要访问位于已知物理地址的内存,Linux会将所有物理内存1:1映射到一定范围的内核虚拟地址空间,因此它可以简单地使用物理地址作为偏移量索引到该数组"中.复杂度为32具有比内核地址空间更多的物理RAM的系统上的位内核.)

由页面组成的线性地址空间分为四个部分

否,Linux使用平面内存模型.所有这四个段描述符的基数和限制均为0和-1(无限制).即它们全部完全重叠,覆盖了整个32位虚拟线性地址空间.

所以红色部分由两个部分组成: __ KERNEL_CS __ KERNEL_DS

否,这是您出错的地方.x86 段寄存器用于分段;它们是x86的传统行李,仅用于x86-64上的CPU模式和特权级别选择.AMD并没有为此添加新的机制并完全删除长模式的分段,而只是在长模式下绝杀分段(就像固定在32位模式中的每个人一样,将基数固定为0),并且仅将分段用于机器配置目的,而不是将分段用于机器配置目的.特别有趣,除非您实际上正在编写切换到 32 位模式或其他任何模式的代码.

(除了可以为FS和/或GS设置非零基数,而Linux可以为线程本地存储设置非零基数.但这与 copy_from_user()的实现方式无关或其他任何内容.它仅需检查该指针值,而无需参考任何段或段描述符的CPL/RPL.)

在 32 位传统模式下,可以编写使用分段内存模型的内核,但实际上没有一个主流操作系统这样做.不过,有些人希望这已成为一件事情.请参阅此答案感叹了x86-64,使Multics风格的操作系统无法实现.但这不是Linux的工作原理.

Linux 是一个 https://wiki.osdev.org/Higher_Half_Kernel,其中内核指针有一个值范围(红色部分)和用户空间地址在绿色部分中.如果映射了正确的用户空间页表,则内核可以简单地取消对用户空间地址的引用,它不需要转换它们或对段进行任何操作.这就是拥有平面内存模型的意思.(内核可以使用用户"页表条目,但不能,反之亦然).对于x86-64,请参见 https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt 获取实际的内存映射.


这4个GDT条目都需要分开的唯一原因是出于特权级的原因,并且数据段与代码段的描述符具有不同的格式.(GDT条目不仅包含基本/限制;这些是需要不同的部分.请参见 https://wiki.osdev.org/Global_Descriptor_Table )

尤其是 https://wiki.osdev.org/Segmentation#Notes_Regarding_C ,其中描述了如何以及为什么常规"操作系统通常使用GDT来创建平面内存模型,并为每个特权级别使用一对代码和数据描述符.

对于32位Linux内核,只有 gs 获得线程本地存储的非零基数(因此将访问诸如 [gs:0x10] 之类的寻址模式取决于执行它的线程的线性地址).或者在64位内核(和64位用户空间)中,Linux使用 fs .(因为x86-64通过 swapgs 指令使GS特别,该指令旨在与 syscall 一起用于内核以查找内核堆栈.)

但是无论如何,FS或GS的非零基不是来自GDT条目,而是通过 wrgsbase 指令设置的.(或者在不支持该功能的CPU上写入MSR).


但是那些标志是什么,即 0xc09b0xa09b 等等?我倾向于相信它们是段选择器

否,分段选择器是GDT的索引.内核使用指定的初始化程序语法(例如 [GDT_ENTRY_KERNEL32_CS] = initializer_for_that_selector )将GDT定义为C数组.

(实际上,选择器的低2位(即段寄存器值)是当前特权级别.因此 GDT_ENTRY_DEFAULT_USER_CS 应该为'__USER_CS >>2.)

mov ds,eax 触发硬件为GDT编制索引,而不是线性搜索GDT以查找内存中的匹配数据!

GDT数据格式:

您正在查看x86-64 Linux源代码,因此内核将处于long模式,而非保护模式.我们可以知道,因为 USER_CS USER32_CS 有单独的条目.32位代码段描述符将清除其 L 位.当前的CS段描述是将x86-64 CPU置于32位兼容模式和64位长模式的原因.要输入32位用户空间,请使用 iret sysret 将CS:RIP设置为用户模式的32位段选择器.

认为,您还可以使CPU处于16位兼容模式(例如,兼容模式不是实模式,但默认操作数大小和地址大小为16).但是,Linux不会这样做.

无论如何,如 https://wiki.osdev.org/Global_Descriptor_Table 和细分中所述,

每个段描述符都包含以下信息:

  • 段的基地址
  • 段中的默认操作大小(16位/32位)
  • 描述符的特权级别(Ring 0-> Ring 3)
  • 粒度(段限制以字节/4kb为单位)
  • 细分限制(细分中的最大合法偏移量)
  • 细分受众群的存在(是否存在)
  • 描述符类型(0 = 系统;1 = 代码/数据)
  • 段类型(代码/数据/读取/写入/访问/符合/不符合/扩展/扩展/向下扩展)

这些是多余的位.我对哪个位并不特别感兴趣,因为我(认为我)了解了不同GDT条目的用途和作用的高级知识,而无需深入了解其实际编码方式.

但是,如果您查看x86手册或osdev Wiki,以及这些init宏的定义,则应该发现它们导致GDT条目的 L 位设置为64位代码段,为32位代码段清除.显然,类型(代码与数据)和特权级别有所不同.

Looking into the internals of Linux and memory management, I just stumbled upon the segmented paging model that Linux uses.

Correct me if I am wrong, but Linux (protected mode) does use paging for mapping a linear virtual address space to the physical address space. This linear address space constituted of pages, is split into four segments for the process flat memory model, namely:

  • The kernel code segment (__KERNEL_CS);
  • The kernel data segment (__KERNEL_DS);
  • The user code segment (__USER_CS);
  • The user data segment (__USER_DS);

A fifth memory segment known as the Null segment is present but unused.

These segments have a CPL (Current Privilege Level) of either 0 (supervisor) or 3 (userland).

To keep it simple, I will concentrate of the 32-bit memory mapping, with a 4GiB adressable space, 3GiB being for the userland process space (shown in green), 1GiB being for the supervisor kernel space (shown in red):

So the red part consists of two segments __KERNEL_CS and __KERNEL_DS, and the green part of two segments __USER_CS and __USER_DS.

These segments overlap each others. Paging will be used for userland and kernel isolation.

However, as extracted from Wikipedia here:

[...] many 32-bit operating systems simulate a flat memory model by setting all segments' bases to 0 in order to make segmentation neutral to programs.

Looking into the linux kernel code for the GDT here:

[GDT_ENTRY_KERNEL32_CS]       = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
[GDT_ENTRY_KERNEL_CS]         = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
[GDT_ENTRY_KERNEL_DS]         = GDT_ENTRY_INIT(0xc093, 0, 0xfffff),
[GDT_ENTRY_DEFAULT_USER32_CS] = GDT_ENTRY_INIT(0xc0fb, 0, 0xfffff),
[GDT_ENTRY_DEFAULT_USER_DS]   = GDT_ENTRY_INIT(0xc0f3, 0, 0xfffff),
[GDT_ENTRY_DEFAULT_USER_CS]   = GDT_ENTRY_INIT(0xa0fb, 0, 0xfffff),

As Peter pointed out, each segment begin at 0, but what are those flags, namely 0xc09b, 0xa09b and so on ? I tend to believe they are the segments selectors, if not, how would I be able to access the userland segment from the kernel segment, if both their addressing space start at 0 ?

Segmentation is not used. Only paging is used. Segments have their seg_base addresses set 0, extending their space to 0xFFFFF and thus giving a full linear address space. That means that logical addresses are not different from linear addresses.

Also, since all segments overlap each others, is it the paging unit which provides memory protection (i.e. the memory separation) ?

Paging provide protection, not segmentation. The kernel will check the linear address space, and, according to a boundary (often known as TASK_MAX), will check the privilege level for the requested page.

解决方案

Yes, Linux uses paging so all addresses are always virtual. (To access memory at a known physical address, Linux keeps all physical memory 1:1 mapped to a range of kernel virtual address space, so it can simply index into that "array" using the physical address as the offset. Modulo complications for 32-bit kernels on systems with more physical RAM than kernel address space.)

This linear address space constituted of pages, is split into four segments

No, Linux uses a flat memory model. The base and limit for all 4 of those segment descriptors are 0 and -1 (unlimited). i.e. they all fully overlap, covering the entire 32-bit virtual linear address space.

So the red part consists of two segments __KERNEL_CS and __KERNEL_DS

No, this is where you went wrong. x86 segment registers are not used for segmentation; they're x86 legacy baggage that's only used for CPU mode and privilege-level selection on x86-64. Instead of adding new mechanisms for that and dropping segments entirely for long mode, AMD just neutered segmentation in long mode (base fixed at 0 like everyone used in 32-bit mode anyway) and kept using segments only for machine-config purposes that are not particularly interesting unless you're actually writing code that switches to 32-bit mode or whatever.

(Except you can set a non-zero base for FS and/or GS, and Linux does so for thread-local storage. But this has nothing to do with how copy_from_user() is implemented, or anything. It only has to check that pointer value, not with reference to any segment or the CPL / RPL of a segment descriptor.)

In 32-bit legacy mode, it is possible to write a kernel that uses a segmented memory model, but none of the mainstream OSes actually did that. Some people wish that had become a thing, though, e.g. see this answer lamenting x86-64 making a Multics-style OS impossible. But this is not how Linux works.

Linux is a https://wiki.osdev.org/Higher_Half_Kernel, where kernel pointers have one range of values (the red part) and user-space addresses are in the green part. The kernel can simple dereference user-space addresses if the right user-space page-tables are mapped, it doesn't need to translate them or do anything with segments; this is what it means to have a flat memory model. (The kernel can use "user" page-table entries, but not vice versa). For x86-64 specifically, see https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt for the actual memory map.


The only reason those 4 GDT entries all need to be separate is for privilege-level reasons, and that the data vs. code segments descriptors have different formats. (A GDT entry contains more than just the base/limit; those are the parts that need to be different. See https://wiki.osdev.org/Global_Descriptor_Table)

And especially https://wiki.osdev.org/Segmentation#Notes_Regarding_C which describes how and why the GDT is typically used by a "normal" OS to create a flat memory model, with a pair of code and data descriptors for each privilege level.

For a 32-bit Linux kernel, only gs gets a non-zero base for thread-local storage (so addressing modes like [gs: 0x10] will access a linear address that depends on the thread that executes it). Or in a 64-bit kernel (and 64-bit user-space), Linux uses fs. (Because x86-64 made GS special with the swapgs instruction, intended for use with syscall for the kernel to find the kernel stack.)

But anyway, the non-zero base for FS or GS are not from a GDT entry, they're set with the wrgsbase instruction. (Or on CPUs that don't support that, with a write to an MSR).


but what are those flags, namely 0xc09b, 0xa09b and so on ? I tend to believe they are the segments selectors

No, segment selectors are indices into the GDT. The kernel is defining the GDT as a C array, using designated-initializer syntax like [GDT_ENTRY_KERNEL32_CS] = initializer_for_that_selector.

(Actually the low 2 bits of a selector, i.e. segment register value, are the current privilege level. So GDT_ENTRY_DEFAULT_USER_CS should be `__USER_CS >> 2.)

mov ds, eax triggers the hardware to index the GDT, not linear search it for matching data in memory!

GDT data format:

You're looking at x86-64 Linux source code, so the kernel will be in long mode, not protected mode. We can tell because there are separate entries for USER_CS and USER32_CS. The 32-bit code segment descriptor will have its L bit cleared. The current CS segment description is what puts an x86-64 CPU into 32-bit compat mode vs. 64-bit long mode. To enter 32-bit user-space, an iret or sysret will set CS:RIP to a user-mode 32-bit segment selector.

I think you can also have the CPU in 16-bit compat mode (like compat mode not real mode, but the default operand-size and address size are 16). Linux doesn't do this, though.

Anyway, as explained in https://wiki.osdev.org/Global_Descriptor_Table and Segmentation,

Each segment descriptor contains the following information:

  • The base address of the segment
  • The default operation size in the segment (16-bit/32-bit)
  • The privilege level of the descriptor (Ring 0 -> Ring 3)
  • The granularity (Segment limit is in byte/4kb units)
  • The segment limit (The maximum legal offset within the segment)
  • The segment presence (Is it present or not)
  • The descriptor type (0 = system; 1 = code/data)
  • The segment type (Code/Data/Read/Write/Accessed/Conforming/Non-Conforming/Expand-Up/Expand-Down)

These are the extra bits. I'm not particularly interested in which bits are which because I (think I) understand the high level picture of what different GDT entries are for and what they do, without getting into the details of how that's actually encoded.

But if you check the x86 manuals or the osdev wiki, and the definitions for those init macros, you should find that they result in a GDT entry with the L bit set for 64-bit code segments, cleared for 32-bit code segments. And obviously the type (code vs. data) and privilege level differ.

这篇关于Linux内存分段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆