linux内核页表更新 [英] linux kernel page table update

查看:143
本文介绍了linux内核页表更新的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Linux x86分页中.

  1. 每个进程都有其自己的页面目录.

  2. 页表遍历从CR3指向的页目录开始.

  3. 每个进程共享内核页面目录内容

假设三个句子是正确的,假设某个进程进入了内核 模式并更新其内核页面目录内容(地址映射,访问 权利等)

问题.由于内核地址空间是进程之间全局共享的, 此更新必须与其他进程的页面目录同步, 正确的?

如何管理?

先谢谢您

解决方案

我不了解Linux,因此我将针对Windows进行回答.某些内核空间是全局"的,这是在PTE中设置的标志,指示多个进程使用了​​它.可以在寄存器操作数中配置INVPCID指令,以在TLB无效中包括或排除这些条目.这些页表条目在进程之间共享,并且所有条目都出现在每个进程的页表中的同一位置.这样,只需更新单个PTE,而无需同步其他进程的其他PTE,因为它们在物理地址上都共享一个PTE.

http://www.cs.miami.edu /home/burt/journal/NT/memory.html

某些内核内存并非对所有进程都可见,并且对每个进程都是私有的(不会改变它仍然是环0的事实).在32位Windows系统上,其值为0xC0000000–0xC0200000,其中包含所有用户空间PTE和PDE,其中0xC0000000是PTE_BASE,这使得等式#define MiGetPteAddress (x) ((PMMPTE)(((((ULONG)(x)) >> 12) << 2) + (ULONG_PTR)MmPteBase)) #define MiAddressToPte(x) MiGetPteAddress(x)可以优雅地工作,以将cr2中的错误虚拟地址转换为地址的PTE.这是每个进程专用的,因为每个进程都有相同的基本PTE分配基地址;如果对所有进程可见,它将迅速占用虚拟内存,因为必须按顺序分配每组页面占用的内存.它不需要对所有进程都可见,因为一个进程对另一个进程的页表条目没有兴趣.页面错误总是在当前进程的上下文中处理,而0xC0000000–0xC0200000意味着在每个进程上下文中有所不同.

但是用于分配内核PTE(用于内核地址)的内核空间0xC0200000–0xC0400000将是全局的,并由所有进程共享,但其中表示0xC0000000–0xC0200000的部分除外,根据我的计算,该部分为0xC0300000–0xC0300800,这是PDE的用户模式端,如PDE_BASE = 0xC0300000–0xC0300FFF.

但是,不可能将用户PDE和内核PDE部分分开,以使前者是私有的而后者是全局的(即,使0xC0300000–0xC0300800私有(指向不同的物理地址),而0xC0300000–0xC0300FFF指向相同的)每个进程的物理地址),因为整个PDE区域(0xC0300000–0xC0300FFF)将位于同一物理帧上,并构成由cr3指向的单个帧,并且cr3对于每个进程都不同,这意味着整个PDE区域(所有PDE)都必须对每个进程私有(每个进程重复并安装).如果将内核页面表页面(包含内核页面表的页面)调出页面并移到新的物理位置,则所有PDE都必须同步,因为所有进程在不同的cr3物理地址而不是相同的物理PDE上都有副本.我不确定它是如何做到这一点的(有效的)ATM,因此最好施加限制,不允许内核页表被调出并放入非分页池中.这样,内核PDE在所有CR3页面上将保持不变.在64位上,可能会施加限制,使内核PDPT不能被调出.在32位Windows上,进程从物理CR3页面开始,该页面的PDE偏移量为1100000000(基数2)* 4个字节,指向其自身,该页面被硬写入,可能是通过短暂关闭cr0中的分页来完成的(因为写入操作不会无需在其中编写递归条目的情况下就可以成功,从而产生了一个悖论).注意,PD条目本身就是覆盖范围为0xC0000000–0xC0400000的页面表,即它指向1023个页面表和1个页面目录(本身)(2 ^ 10个条目),因此允许通过其虚拟地址修改PTE. . CR3页面位于0xC0300000的原因是因为该地址具有相同的页面目录以及页面表索引1100000000和1100000000,所以它在自身上循环两次,因此产生了CR3页面,您可以按地址修改PDE(还有其他像这样的特殊地址,例如0xE0380000).设置完成后,将进行适当的内核映射.在64位Windows上,使用单个指向其自身的PML4表页面设置进程的过程将与此类似,并且由于回送的数量可变,因此可以填充和访问任何PML4E,PDPTE,PDE或PTE.在64位Windows上,当某个进程终止时,该进程的所有物理页将移至空闲列表,该列表将包括所有用户物理PDPT页,PD页,PT页和PML4/CR3页.内核列表将不会被标记为空闲列表.

通常,如果您知道PML4中的哪个条目是物理PML4页面的递归条目,则可以计算出某个虚拟地址的PTE的虚拟地址.您将PML4中的偏移量(32位为10位; 64位为9位)附加到其自身的条目,虚拟地址的开头(这是较早的32位方程式中0xC0000000的加法),并且删除最后12位,然后将虚拟地址末尾的PT中的偏移量乘以8(或4),从而将PT中的偏移量补偿为12位(因此,将右移12并将左移3(或2) (用于32位条目)). 1个环回占用了1个间接层,您将获得PTE的虚拟地址. 2次环回将为您保留PDE的虚拟地址,依此类推.在32位窗口上的PTE_BASE是偏移量110000000左移为32位,而PDE_BASE是偏移量110000000110000000左移为32位.它在宏中使用,并且具有此前缀的任何虚拟地址根据定义将分别是PTE或PDE的一部分. Windows为页表层次结构选择偏移量1100000000,但它可以是2 ^ 9组合中的任何一种.

为减轻熔化而设计的

KAISER或KPTI,很可能每个过程都有2个cr3.捕获到内核后,用于用户模式的受限cr3将包含一个包含所有内核PML4E的完整cr3,该限制将包含单个内核PML4E(足以访问执行交换的初步中断调度例程功能). /p>

关于Windows上的物理内存,请参见此处: https://superuser.com/a/1549970/933117

In linux x86 paging.

  1. each process has it's own page directory.

  2. page table walking starts with page directory which is pointed by CR3.

  3. every process shares the kernel page directory content

assuming three sentences are correct, let's say some process enters kernel mode and updates his kernel page directory content(address mapping, access rights, etc...)

Question. since kernel address spaces is globally shared among processes, this update has to be synchronized with other process's page directory, right?

how can this be managed?

thank you in advance.

解决方案

I don't know about Linux, so I'll answer for Windows. Some of the kernel space is 'global', which is a flag set in the PTE to indicate it is used by more than one process. The INVPCID instruction can be configured in the register operand to include or exclude these entries in a TLB invalidate. These page table entries are shared between the processes and all appear at the same place in the page table for each process. This way, only the single PTE needs to be updated and it doesn't need to synchronise other PTEs of other processes as they all share a single PTE at a physical address.

http://www.cs.miami.edu/home/burt/journal/NT/memory.html

Some kernel memory is not visible to all processes and is private to each process (doesn't change the fact it is still ring 0). This, on a 32 bit Windows system would be 0xC0000000–0xC0200000 which contains all the user space PTEs and PDEs where 0xC0000000 is the PTE_BASE which allows for the equation #define MiGetPteAddress (x) ((PMMPTE)(((((ULONG)(x)) >> 12) << 2) + (ULONG_PTR)MmPteBase)) #define MiAddressToPte(x) MiGetPteAddress(x) to work elegantly for converting faulting virtual address in cr2 to the address of the PTE. This is private to each process as each process has the same base PTE allocation base address; if it were visible to all processes it would quickly take up virtual memory as each set of page takes would have to be allocated sequentially. It doesn't need to be visible to all processes because a process has no interest in the page table entries of another process. A page fault is always handled in the context of the current process, and 0xC0000000–0xC0200000 means something different in each process context.

The kernel space 0xC0200000–0xC0400000 for allocation of kernel PTEs (for kernel addresses) would however be global and shared by all processes, except for the section within it representing 0xC0000000–0xC0200000, which by my calculation will be 0xC0300000–0xC0300800, which is the user-mode side of the PDEs as PDE_BASE = 0xC0300000–0xC0300FFF.

It is however impossible to split up the user PDE and kernel PDE section such that the former is private and the latter is global (i.e. make 0xC0300000–0xC0300800 private (point to different physical addresses) and 0xC0300000–0xC0300FFF point to the same physical address for each process) because the whole PDE region (0xC0300000–0xC0300FFF) will lie on the same physical frame and constitutes a single frame pointed to by cr3, and the cr3 is different for each process, which means that the whole PDE region (all PDEs) would have to private per process (duplicated and installed per process). If a kernel page table page (a page containing a kernel page table) were paged out and in to a new physical location then the PDEs would all have to be synchronised because all processes have copies at different cr3 physical addresses and not the same physical PDE. I'm not sure how it does this (efficiently) ATM therefore it would be wise to impose the restriction of not allowing the kernel page tables to be paged out and have them in non-paged pool; this way the kernel PDEs will remain constant across all CR3 pages. On 64 bit, the restriction could be imposed that kernel PDPTs can't be paged out. On 32 bit Windows, a process is started with a physical CR3 page with a PDE at offset 1100000000(base 2)*4 bytes pointing to itself which is hardwritten in, probably by briefly turning off paging in cr0 (because the write won't succeed without the recursive entry that needs to be written being there, creating a paradox). Notice, the PD Entry for itself is the page table that covers the range 0xC0000000–0xC0400000 i.e. it points to 1023 page tables and 1 page directory (itself) (2^10 entries) and hence allows the PTEs to be modified by their virtual address. The reason why the CR3 page is at 0xC0300000 is because the address has the same page directory and page table indexes 1100000000 and 1100000000 so it loops back on itself twice, therefore yielding the CR3 page and you can modify the PDEs by address (there are other addresses that are special like this e.g. 0xE0380000). After it is set up, the appropriate kernel mappings are made. On 64 bit Windows it would be similar where a process is set up with a single PML4 table page which points to itself and this way any PML4E, PDPTE, PDE or PTE can be filled in and accessed due to the variable amount of loopbacks. On 64 bit Windows, when a process is terminated, all the physical pages of the process get moved to the free list which would include all user physical PDPT pages, PD pages, PT pages and the PML4/CR3 page. The kernel ones would not be marked for the free list.

In general, if you know what entry in the PML4 is the recursive entry to the physical PML4 page you can work out the virtual address of the PTE of a certain virtual address. You append the offset (10 bits for 32 bit; 9 bits for 64 bit) in the PML4 to the entry to itself, to the start of the virtual address (which is what the addition of 0xC0000000 is in the 32 bit equation earlier) and remove the last 12 bits and then make up the offset in the PT now at the end of the virtual address to 12 bits by multiplying it by 8 (or 4) (hence the right shift by 12 and the left shift by 3 (or 2 for 32 bit entries)). 1 loopback takes away 1 layer of indirection and you get the virtual address of the PTE. 2 loopbacks will leave you with the virtual address of the PDE and so on. PTE_BASE on 32 bit windows is the offset 110000000 left shifted to make 32 bits and PDE_BASE is the offset 110000000110000000 left shifted to make 32 bits. It is used in the macro and any virtual address with this prefix will by definition be part of a PTE or a PDE respectively. Windows chooses the offset 1100000000 for the page table hierarchy but it could be any one of the 2^9 combinations.

KAISER, or KPTI, designed to mitigate meltdown, most likely has 2 cr3s for each process. Upon trapping to the kernel, the restricted cr3 for user mode which would contain a single kernel PML4E—enough for a preliminary interrupt dispatch routine function to be accessible, which performs the swap—would be replaced with the full cr3 containing all kernel PML4Es.

As for physical memory on windows, see here: https://superuser.com/a/1549970/933117

这篇关于linux内核页表更新的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆