是真的吗,现代操作系统在调用realloc时可能会跳过复制 [英] Is it true, that modern OS may skip copy when realloc is called

查看:149
本文介绍了是真的吗,现代操作系统在调用realloc时可能会跳过复制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在阅读 https://stackoverflow.com/a/3190489/196561 时,我有一个问题. Qt作者在在Qt 4容器内部中说的话:

... QVector使用realloc()以4096字节为增量进行增长.这是有道理的,因为现代操作系统在重新分配缓冲区时不会复制整个数据.只需简单地对物理内存页面进行重新排序,并且只需要复制首页和最后一页上的数据即可.

我的问题是:

1)的确,现代OS(对我而言最有趣的是 Linux ; FreeBSD,OSX,Windows)及其重新分配实现确实能够通过使用virtual-to的重新排序来重新分配数据页物理映射并且没有逐字节复制?

2)用于实现此内存移动的系统调用是什么? (我认为使用SPLICE_F_MOVE可能是 splice ,但它存在缺陷,没有-op现在(?))

3)使用这种页面改组而不是逐字节复制是否有利可图,尤其是在多核多线程世界中,在这种情况下,虚拟到物理映射的每次更改都需要刷新(无效)来自<数十个具有 TLB .org/wiki/Inter-processor_interrupt"rel =" nofollow noreferrer> IPI ? (在linux中,这有点像 flush_tlb_rangeflush_tlb_page )) /p>

第三季度更新:对mremap和memcpy的一些测试

解决方案

1)的确,现代OS(Linux-对我来说最有趣; FreeBSD,OSX,Windows)及其重新分配实现确实能够使用虚拟到物理映射的重新排序而无字节地重新分配数据页.按字节复制?

2)用于实现此内存移动的系统调用是什么? (我认为可以与SPLICE_F_MOVE拼接在一起,但是它有缺陷,现在不能操作了(?).

请参见 thejh 的答案.

谁是演员?

您的Qt示例中至少有三个演员.

  1. Qt Vector类
  2. glibc realloc()
  3. Linux的mremap

QVector::capacity() 显示Qt分配了比所需更多的元素.这意味着典型的元素添加不会realloc()任何东西. glibc 分配器基于 Doug Lea的分配器 .这是一个 binning 分配器,它支持使用Linux的mremap. binning 分配器在 bins 中将相似大小的分配分组,因此,典型的随机大小分配仍将有一些增长空间,而无需调用系统.也就是说,空闲池或空闲空间位于分配的内存的末尾.

答案

3)使用这种页面改组而不是逐字节复制是否有利可图,尤其是在多核多线程世界中,在这种情况下,虚拟到物理映射的每次更改都需要刷新(无效)TLB中已更改的页面表条目在具有IPI的所有数十个CPU内核中? (在Linux中,类似smt_tlb_range或flush_tlb_page)

首先,比mremap更快... 在那里误用了mremap()作为 R 注释.

有几件事使mremap()作为realloc()的基元很有价值.

  1. 减少内存消耗.
  2. 保留页面映射.
  3. 避免移动数据.

此答案中的所有内容都基于Linux的实现,但是语义可以转移到其他OS.

减少内存消耗

考虑一个天真 realloc().

void *realloc(void *ptr, size_t size)
{
    size_t old_size = get_sz(ptr);  /* From bin, address, map table, etc */
    if(size <= old_size) {
      resize(ptr);
      return ptr;
    }    
    void * new_p = malloc(size);
    if(new_p) {
      memcpy(new_p, ptr, old_size);  /* fully committed old_size + new size */
      free(ptr); 
    }
    return new_p;
}

为了支持此操作,在进行交换或根本无法重新分配之前,可能需要使realloc()的内存增加一倍.

保留页面映射

默认情况下,Linux将新分配映射到零页面;充满零数据的4k页.这对于稀疏映射的数据结构很有用.如果没有人写入数据页,则除了可能的PTE表之外,没有分配物理内存.这些是写时复制 COW .通过使用 naive realloc(),将不会保留这些映射,并为所有零页面分配完整的物理内存.

如果任务包含在fork()中,则初始的realloc()数据可能在父级和子级之间共享.同样, COW 将导致页面的物理分配. 天真实现会对此加以折让,并且每个进程需要单独的物理内存.

如果系统承受内存压力,则现有的realloc()页可能不在物理内存中,而是处于交换状态. 天真 realloc将导致磁盘将交换页读取到内存中,复制到更新的位置,然后可能将数据写出到磁盘中.

避免移动数据

与数据相比,您考虑更新 TLB 的问题微乎其微.单个 TLB 通常为4个字节,代表物理数据的页面(4K).如果为4GB系统刷新整个 TLB ,则需要还原4MB数据.复制大量数据将破坏L1和L2缓存.与 d-cache 和 i-cache 相比, TLB 自然获取管道更好.由于大多数代码是连续的,因此代码很少会连续两次出现 TLB 丢失.

CPU具有两个变体 VIVT ( non-x86 )和 VIPT (根据 x86 ). VIVT 版本通常具有使单个 TLB 条目无效的机制.对于 VIPT 系统,缓存无需进行物理标记,因此无需使其无效.

在多核系统上,通常在所有核上运行一个进程.只有具有执行mremap()的进程的核心才需要进行页表更新.在将进程迁移到核心(典型的上下文切换)时,无论如何都需要迁移页表.

结论

您可以构建一些病理案例,其中天真副本会更好地工作.由于Linux(以及大多数操作系统)用于多任务,因此将运行多个进程.另外,最糟糕的情况是交换时,天真实现在这里总是更糟(除非您的磁盘速度快于内存).对于最小的realloc()大小, dlmalloc QVector 应该具有 fallow 空间,以避免系统级mremap().典型的mremap()可能只是通过使用免费池中的随机页面来扩展区域来扩展虚拟地址范围.仅在虚拟地址范围必须移动的情况下,mremap()可能需要 tlb刷新,以下所有条件均为真,

  1. realloc()内存不应与父进程或子进程共享.
  2. 内存不应稀疏(多数为零或未被触摸).
  3. 使用swap时,系统不应处于内存压力下.

tlb刷新和IPI仅在其他内核上具有相同的进程时才需要进行. mremap()不需要L1缓存加载,而 naive 版本则需要加载L1缓存. L2通常在内核之间共享,并且在所有情况下都是最新的. naive 版本将强制L2重新加载. mremap()可能会将一些未使用的数据从L2缓存中移出.这通常是一件好事,但在某些工作负荷下可能是不利的.可能有更好的方法来做到这一点,例如预取数据.

While reading the https://stackoverflow.com/a/3190489/196561 I have a question. What the Qt authors says in the Inside the Qt 4 Containers:

... QVector uses realloc() to grow by increments of 4096 bytes. This makes sense because modern operating systems don't copy the entire data when reallocating a buffer; the physical memory pages are simply reordered, and only the data on the first and last pages need to be copied.

My questions are:

1) Is it true that modern OS (Linux - the most interesting for me; FreeBSD, OSX, Windows) and their realloc implementations really capable to realloc pages of data using reordering of virtual-to-physical mapping and without byte-by-byte copy?

2) What is the system call used to achieve this memory move? (I think it can be splice with SPLICE_F_MOVE, but it was flawed and no-op now (?))

3) Is it profitable to use such page shuffling instead of byte-by-byte copy, especially in multicore multithreaded world, where every change of virtual-to-physical mapping need to flush (invalidate) changed page table entries from TLBs in all tens of CPU cores with IPI? (In linux this is smth like flush_tlb_range or flush_tlb_page)

upd for q3: some tests of mremap vs memcpy

解决方案

1) Is it true that modern OS (Linux - the most interesting for me; FreeBSD, OSX, Windows) and their realloc implementations really capable to realloc pages of data using reordering of virtual-to-physical mapping and without byte-by-byte copy?

2) What is the system call used to achieve this memory move? (I think it can be splice with SPLICE_F_MOVE, but it was flawed and no-op now (?))

See thejh's answer.

Who are the actors?

You have at least three actors with your Qt example.

  1. Qt Vector class
  2. glibc's realloc()
  3. Linux's mremap

QVector::capacity() shows that Qt allocates more elements than required. This means that a typical addition of an element will not realloc() anything. The glibc allocator is based on Doug Lea's allocator. This is a binning allocator which supports the use of Linux's mremap. A binning allocator groups similar sized allocations in bins, so a typical random sized allocation will still have some room to grow without needing to call the system. Ie, the free pool or slack is located a the end of the allocated memory.

An answer

3) Is it profitable to use such page shuffling instead of byte-by-byte copy, especially in multicore multithreaded world, where every change of virtual-to-physical mapping need to flush (invalidate) changed page table entries from TLBs in all tens of CPU cores with IPI? (In linux this is smth like flush_tlb_range or flush_tlb_page)

First off, faster ... than mremap is mis-using mremap() as R notes there.

There are several things that make mremap() valuable as a primitive for realloc().

  1. Reduced memory consumption.
  2. Preserving page mappings.
  3. Avoiding moving data.

Everything in this answer is based upon Linux's implementation, but the semantics can be transferred to other OS's.

Reduce memory consumption

Consider a naive realloc().

void *realloc(void *ptr, size_t size)
{
    size_t old_size = get_sz(ptr);  /* From bin, address, map table, etc */
    if(size <= old_size) {
      resize(ptr);
      return ptr;
    }    
    void * new_p = malloc(size);
    if(new_p) {
      memcpy(new_p, ptr, old_size);  /* fully committed old_size + new size */
      free(ptr); 
    }
    return new_p;
}

In order to support this, you may need double the memory of the realloc() before you go to swap or simply fail to reallocate.

Preserve page mappings

Linux will by default map new allocations to a zero page; a 4k page full of zero data. This is useful for sparsely mapped data structures. If no one writes to the data page, then no physical memory is allocated besides a possible PTE table. These are copy on write or COW. By using the naive realloc(), these mappings will not be preserved and full physical memory is allocated for all zero pages.

If the task is involved in a fork(), the initial realloc() data maybe shared between parent and child. Again, COW will cause physical allocation of pages. The naive implementation will discount this and require separate physical memory per process.

If the system is under memory pressure, the existing realloc() pages may not be in physical memory but in swap. The naive realloc will cause disk reads of the swap page into memory, copy to the updated location, and then likely write the data out to the disk.

Avoid moving data

The issue you consider of updating TLBs is minimal compared to the data. A single TLB is typically 4 bytes and represents a page (4K) of physical data. If you flush the entire TLB for a 4GB system, that is 4MB of data that needs to be restored. Copying large amounts of data will blow the L1 and L2 caches. TLB fetches naturally pipeline better than d-cache and i-cache. It is rare that code will get two TLB misses in a row as most code is sequential.

CPUs are of two variants, VIVT (non-x86) and VIPT as per x86. The VIVT versions usually have mechanisms to invalidate single TLB entries. For a VIPT system, the caches should not need to be invalidated as they are physically tagged.

On a multi-core systems, it is atypical to run one process on all cores. Only cores with the process performing mremap() need to have page table updates. As a process is migrated to a core (typical context switch), it needs to have the page table migrated anyways.

Conclusion

You can construct some pathological cases where a naive copy will work better. As Linux (and most OS's) are for multi-tasking, more than one process will be running. Also, the worst case will be when swapping and the naive implementation will always be worse here (unless you have a disk faster than memory). For minimal realloc() sizes, the dlmalloc or QVector should have fallow space to avoid system level mremap(). A typical mremap() may just expand a virtual address range by growing the region with a random page from the free pool. It is only when the virtual address range must move that mremap() may need a tlb flush, with all the following being true,

  1. The realloc() memory should not be shared with a parent or child process.
  2. The memory should not be sparse (mostly zero or untouched).
  3. The system should not be under memory pressure using swap.

The tlb flush and IPI needs to take place only if the same process is current on other cores. L1-cache loading in not needed for the mremap(), but is for the naive version. The L2 is usually shared between cores and will be current in all cases. The naive version will force the L2 to reload. The mremap() might leave some unused data out of the L2-cache; this is normally a good thing, but could be a disadvantage under some work loads. There are probably better ways to do this such as pre-fetching data.

这篇关于是真的吗,现代操作系统在调用realloc时可能会跳过复制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆