是真的吗,现代操作系统在调用 realloc 时可能会跳过复制 [英] Is it true, that modern OS may skip copy when realloc is called

查看:26
本文介绍了是真的吗,现代操作系统在调用 realloc 时可能会跳过复制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在阅读 https://stackoverflow.com/a/3190489/196561 时,我有一个问题.Qt 作者在Qt 4 容器内中说:

While reading the https://stackoverflow.com/a/3190489/196561 I have a question. What the Qt authors says in the Inside the Qt 4 Containers:

... QVector 使用 realloc() 以 4096 字节的增量增长.这是有道理的,因为现代操作系统在重新分配缓冲区时不会复制整个数据;物理内存页只是简单地重新排序,只需要复制第一页和最后一页的数据.

... QVector uses realloc() to grow by increments of 4096 bytes. This makes sense because modern operating systems don't copy the entire data when reallocating a buffer; the physical memory pages are simply reordered, and only the data on the first and last pages need to be copied.

我的问题是:

1) 现代操作系统(Linux - 对我来说最有趣;FreeBSD、OSX、Windows)及其重新分配实现真的能够使用重新排序虚拟到重新分配数据页面吗?- 物理映射并且没有逐字节复制?

1) Is it true that modern OS (Linux - the most interesting for me; FreeBSD, OSX, Windows) and their realloc implementations really capable to realloc pages of data using reordering of virtual-to-physical mapping and without byte-by-byte copy?

2) 用于实现此内存移动的系统调用是什么?(我认为它可以是 splice 与 SPLICE_F_MOVE,但是它有缺陷,现在没有操作(?))

2) What is the system call used to achieve this memory move? (I think it can be splice with SPLICE_F_MOVE, but it was flawed and no-op now (?))

3) 使用这种页面改组而不是逐字节复制是否有利可图,尤其是在多核多线程世界中,虚拟到物理映射的每次更改都需要从 <刷新(无效)更改的页表条目一个 href="http://en.wikipedia.org/wiki/Translation_lookaside_buffer" rel="nofollow noreferrer">TLB 位于所有数十个 CPU 内核中,IPI?(在 linux 中,这类似于 flush_tlb_rangeflush_tlb_page)

3) Is it profitable to use such page shuffling instead of byte-by-byte copy, especially in multicore multithreaded world, where every change of virtual-to-physical mapping need to flush (invalidate) changed page table entries from TLBs in all tens of CPU cores with IPI? (In linux this is smth like flush_tlb_range or flush_tlb_page)

q3 更新:mremap 与 memcpy 的一些测试

推荐答案

1) 现代操作系统(Linux - 对我来说最有趣;FreeBSD、OSX、Windows)及其重新分配实现真的能够使用虚拟到物理映射的重新排序而不用字节重新分配数据页吗?按字节复制?

1) Is it true that modern OS (Linux - the most interesting for me; FreeBSD, OSX, Windows) and their realloc implementations really capable to realloc pages of data using reordering of virtual-to-physical mapping and without byte-by-byte copy?

2) 用于实现此内存移动的系统调用是什么?(我认为它可以与 SPLICE_F_MOVE 拼接,但它现在有缺陷并且没有操作(?))

2) What is the system call used to achieve this memory move? (I think it can be splice with SPLICE_F_MOVE, but it was flawed and no-op now (?))

请参阅thejh 的回答.

您的 Qt 示例至少有三个演员.

You have at least three actors with your Qt example.

  1. Qt 向量类
  2. glibcrealloc()
  3. Linux 的 mremap

QVector::capacity() 表明 Qt 分配的元素比所需的多.这意味着典型的元素添加不会realloc() 任何东西.glibc 分配器基于 Doug Lea 的分配器.这是一个 binning 分配器,它支持使用 Linux 的 mremap.binning 分配器将类似大小的分配分组在 bins 中,因此典型的随机大小分配仍然有一些增长空间,而无需调用系统.即,空闲池或 slack 位于已分配内存的末尾.

QVector::capacity() shows that Qt allocates more elements than required. This means that a typical addition of an element will not realloc() anything. The glibc allocator is based on Doug Lea's allocator. This is a binning allocator which supports the use of Linux's mremap. A binning allocator groups similar sized allocations in bins, so a typical random sized allocation will still have some room to grow without needing to call the system. Ie, the free pool or slack is located a the end of the allocated memory.

3) 使用这种页面改组而不是逐字节复制是否有利可图,尤其是在多核多线程世界中,虚拟到物理映射的每次更改都需要从 TLB 刷新(无效)更改的页表条目在所有数十个带有 IPI 的 CPU 内核中?(在 linux 中,这类似于flush_tlb_range 或flush_tlb_page)

3) Is it profitable to use such page shuffling instead of byte-by-byte copy, especially in multicore multithreaded world, where every change of virtual-to-physical mapping need to flush (invalidate) changed page table entries from TLBs in all tens of CPU cores with IPI? (In linux this is smth like flush_tlb_range or flush_tlb_page)

首先,比 mremap 更快……误用 mremap() 作为 R 注释那里.

First off, faster ... than mremap is mis-using mremap() as R notes there.

有几件事使 mremap() 作为 realloc() 的原语很有价值.

There are several things that make mremap() valuable as a primitive for realloc().

  1. 减少内存消耗.
  2. 保留页面映射.
  3. 避免移动数据.

此答案中的所有内容均基于 Linux 的实现,但语义可以转移到其他操作系统.

Everything in this answer is based upon Linux's implementation, but the semantics can be transferred to other OS's.

考虑一个 naive realloc().

void *realloc(void *ptr, size_t size)
{
    size_t old_size = get_sz(ptr);  /* From bin, address, map table, etc */
    if(size <= old_size) {
      resize(ptr);
      return ptr;
    }    
    void * new_p = malloc(size);
    if(new_p) {
      memcpy(new_p, ptr, old_size);  /* fully committed old_size + new size */
      free(ptr); 
    }
    return new_p;
}

为了支持这一点,您可能需要将 realloc() 的内存加倍,然后再进行交换或重新分配失败.

In order to support this, you may need double the memory of the realloc() before you go to swap or simply fail to reallocate.

Linux 将默认将新分配映射到零页;一个充满零数据的 4k 页.这对于稀疏映射的数据结构很有用.如果没有人写入数据页,则除了可能的 PTE 表之外,不会分配任何物理内存.它们是写入时复制COW.通过使用 naive realloc(),这些映射将不会被保留,并且会为所有 零页 分配完整的物理内存.

Linux will by default map new allocations to a zero page; a 4k page full of zero data. This is useful for sparsely mapped data structures. If no one writes to the data page, then no physical memory is allocated besides a possible PTE table. These are copy on write or COW. By using the naive realloc(), these mappings will not be preserved and full physical memory is allocated for all zero pages.

如果任务涉及到 fork(),初始 realloc() 数据可能会在父子之间共享.同样,COW 将导致页面的物理分配.天真的实现会忽略这一点,并且每个进程需要单独的物理内存.

If the task is involved in a fork(), the initial realloc() data maybe shared between parent and child. Again, COW will cause physical allocation of pages. The naive implementation will discount this and require separate physical memory per process.

如果系统处于内存压力下,现有的 realloc() 页可能不在物理内存中,而是在交换中.naive realloc 将导致磁盘读取交换页到内存中,复制到更新的位置,然后可能将数据写入磁盘.

If the system is under memory pressure, the existing realloc() pages may not be in physical memory but in swap. The naive realloc will cause disk reads of the swap page into memory, copy to the updated location, and then likely write the data out to the disk.

与数据相比,您在更新 TLB 时考虑的问题很小.单个 TLB 通常为 4 个字节,代表一页 (4K) 的物理数据.如果您为 4GB 系统刷新整个 TLB,则需要恢复 4MB 数据.复制大量数据会破坏 L1 和 L2 缓存.TLBd-cachei-cache 更自然地获取管道.由于大多数代码是连续的,因此代码很少会连续出现两次 TLB 未命中.

The issue you consider of updating TLBs is minimal compared to the data. A single TLB is typically 4 bytes and represents a page (4K) of physical data. If you flush the entire TLB for a 4GB system, that is 4MB of data that needs to be restored. Copying large amounts of data will blow the L1 and L2 caches. TLB fetches naturally pipeline better than d-cache and i-cache. It is rare that code will get two TLB misses in a row as most code is sequential.

CPU 有两种变体VIVT(非 x86)和 VIPT,按照 x86.VIVT 版本通常具有使单个 TLB 条目无效的机制.对于 VIPT 系统,缓存不需要被失效,因为它们是物理标记的.

CPUs are of two variants, VIVT (non-x86) and VIPT as per x86. The VIVT versions usually have mechanisms to invalidate single TLB entries. For a VIPT system, the caches should not need to be invalidated as they are physically tagged.

在多核系统上,在所有核上运行一个进程是不典型的.只有进程执行 mremap() 的内核需要更新页表.当进程迁移到核心时(典型的上下文切换),它无论如何都需要迁移页表.

On a multi-core systems, it is atypical to run one process on all cores. Only cores with the process performing mremap() need to have page table updates. As a process is migrated to a core (typical context switch), it needs to have the page table migrated anyways.

您可以构建一些病态的案例,其中 天真的 副本会更好地工作.由于 Linux(和大多数操作系统)用于多任务,因此将运行多个进程.此外,最糟糕的情况是交换时,naive 实现在这里总是更糟(除非你的磁盘比内存快).对于最小的 realloc() 大小,dlmallocQVector 应该有 fallow 空间以避免系统级 mremap().典型的mremap() 可能只是通过使用空闲池 中的随机页面扩展区域来扩展虚拟地址范围.只有当虚拟地址范围必须移动时,mremap()可能需要一个tlb flush,并且以下所有内容都是正确的,

You can construct some pathological cases where a naive copy will work better. As Linux (and most OS's) are for multi-tasking, more than one process will be running. Also, the worst case will be when swapping and the naive implementation will always be worse here (unless you have a disk faster than memory). For minimal realloc() sizes, the dlmalloc or QVector should have fallow space to avoid system level mremap(). A typical mremap() may just expand a virtual address range by growing the region with a random page from the free pool. It is only when the virtual address range must move that mremap() may need a tlb flush, with all the following being true,

  1. realloc() 内存不应与父进程或子进程共享.
  2. 内存不应稀疏(大部分为零或未触及).
  3. 系统不应因使用交换而承受内存压力.
  1. The realloc() memory should not be shared with a parent or child process.
  2. The memory should not be sparse (mostly zero or untouched).
  3. The system should not be under memory pressure using swap.

tlb 刷新 和 IPI 仅在其他内核上存在相同进程时才需要发生.mremap() 不需要 L1 缓存加载,但是 naive 版本需要.L2 通常在内核之间共享,并且在所有情况下都是最新的.naive 版本将强制 L2 重新加载.mremap() 可能会从 L2 缓存中留下一些未使用的数据;这通常是一件好事,但在某些工作负载下可能是不利的.可能有更好的方法来做到这一点,例如预取数据.

The tlb flush and IPI needs to take place only if the same process is current on other cores. L1-cache loading in not needed for the mremap(), but is for the naive version. The L2 is usually shared between cores and will be current in all cases. The naive version will force the L2 to reload. The mremap() might leave some unused data out of the L2-cache; this is normally a good thing, but could be a disadvantage under some work loads. There are probably better ways to do this such as pre-fetching data.

这篇关于是真的吗,现代操作系统在调用 realloc 时可能会跳过复制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆