为什么内存分配器不主动将释放的内存返回给操作系统? [英] Why don't memory allocators actively return freed memory to the OS?

查看:98
本文介绍了为什么内存分配器不主动将释放的内存返回给操作系统?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是的,这可能是您第三次看到此代码,因为我还问了另外两个问题(). 代码很简单:

#include <vector>
int main() {
    std::vector<int> v;
}

然后我在Linux上使用Valgrind构建并运行它:

g++ test.cc && valgrind ./a.out
==8511== Memcheck, a memory error detector
...
==8511== HEAP SUMMARY:
==8511==     in use at exit: 72,704 bytes in 1 blocks
==8511==   total heap usage: 1 allocs, 0 frees, 72,704 bytes allocated
==8511==
==8511== LEAK SUMMARY:
==8511==    definitely lost: 0 bytes in 0 blocks
==8511==    indirectly lost: 0 bytes in 0 blocks
==8511==      possibly lost: 0 bytes in 0 blocks
==8511==    still reachable: 72,704 bytes in 1 blocks
==8511==         suppressed: 0 bytes in 0 blocks
...
==8511== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

在这里,Valgrind报告没有内存泄漏,即使有1个分配和0个空闲.

此处的答案指出C ++标准库使用的分配器不一定会将内存返回给操作系统-可能会将它们保存在内部缓存中.

问题是:

1)为什么将它们保留在内部缓存中?如果是为了提高速度,如何会更快吗?是的,操作系统需要维护一个数据结构来跟踪内存分配,但是此缓存的维护者也需要这样做.

2)这是如何实现的?因为我的程序a.out已经终止,所以没有其他进程正在维护此内存缓存-还是有一个?

对于问题(2)-我见过的一些答案建议"C ++运行时",这是什么意思?如果"C ++运行时"是C ++库,但该库只是磁盘上的一堆机器代码,则它不是正在运行的进程-机器代码要么链接到我的a.out(静态库,.a )或在运行时(共享对象,.so)在a.out的过程中调用.

解决方案

澄清

首先,进行一些澄清.您问: ...我的程序a.out已经终止,没有其他进程正在维护此内存缓存-还是有一个进程?

我们所说的一切都在单个进程的生存期内:总是进程在退出时会返回所有分配的内存.没有超过进程 1 的缓存.即使没有运行时分配器的帮助,也将返回内存:进程终止时,操作系统仅收回"它.因此,正常分配的终止应用程序不会发生系统范围的泄漏.

现在Valgrind报告的是进程终止时 操作系统正在清理所有内容之前正在使用的内存.它可以在运行时库级别上运行,而不能在OS级别上运行.因此,它的意思是嘿,当程序完成时,还有72,000个字节没有返回到运行时",但未说明的含义是这些分配将很快由OS清除".

基本问题

所示的代码和Valgrind输出与名义问题并没有很好的相关性,因此让我们将它们分开.首先,我们将尝试回答您有关分配器的问题:它们为什么存在以及为什么它们通常不立即将释放的内存返回给OS,而忽略了该示例.

您问:

1)为什么将它们保留在内部缓存中?如果是为了速度,怎么样 快点?是的,操作系统需要维护数据结构以保持跟踪 内存分配,但是此缓存的维护者也需要 为此.

这有两个问题,一个是一个问题:一个为什么为什么要麻烦地拥有一个Userland运行时分配器,另一个是(也许是?)为什么这些分配器在释放时不立即将内存返回给OS? .它们是相关的,但让我们一次解决它们.

为什么存在运行时分配器

为什么不仅仅依赖操作系统内存分配例程?

为什么不将内存返回给操作系统

您的示例没有显示它,但是您问,如果编写了不同的测试,您可能会发现,在分配然后释放了一堆内存之后,您的进程驻留了set大小和/或virtual大小,如报告的那样.释放后,操作系统可能不会减少.也就是说,即使您已释放该进程,该进程似乎仍保留在该内存中.实际上,在许多malloc实现中都是如此.首先,请注意,这本身并不是泄漏 -未分配的内存仍可用于分配它的进程,即使不是其他进程也是如此.

他们为什么这样做?原因如下:

  1. 内核API使其难以使用.对于老式的brksbrk 系统调用,它简直就是返回释放的内存是可行的,除非它恰好位于从brksbrk分配的最后一个块的末尾.这是因为这些调用提供的抽象是一个大的连续区域,您只能从一端开始扩展.您不能从内存中途交出内存.大多数分配器并没有尝试去支持所有释放的内存恰好位于brk区域末尾的异常情况,而没有理会.

    mmap调用更加灵活(该讨论通常也适用于Windows,其中VirtualAllocmmap等效项),允许您至少以页面粒度返回内存-但这很难!在释放属于该页面的所有所有分配之前,您无法返回页面.取决于应用程序的大小和分配/免费模式,这可能是常见的还是不常见的.工作良好的情况是分配大量资源-大于一页.在这里,如果通过mmap完成分配,则可以保证释放大部分分配,实际上,某些现代分配器可以直接从mmap满足大型分配,并通过munmap将它们分配回OS.对于glibc(以及扩展的C ++分配运算符),您甚至可以控制此阈值:

    M_MMAP_THRESHOLD
      For allocations greater than or equal to the limit specified
      (in bytes) by M_MMAP_THRESHOLD that can't be satisfied from
      the free list, the memory-allocation functions employ mmap(2)
      instead of increasing the program break using sbrk(2).
    
      Allocating memory using mmap(2) has the significant advantage
      that the allocated memory blocks can always be independently
      released back to the system.  (By contrast, the heap can be
      trimmed only if memory is freed at the top end.)  On the other
      hand, there are some disadvantages to the use of mmap(2):
      deallocated space is not placed on the free list for reuse by
      later allocations; memory may be wasted because mmap(2)
      allocations must be page-aligned; and the kernel must perform
      the expensive task of zeroing out memory allocated via
      mmap(2).  Balancing these factors leads to a default setting
      of 128*1024 for the M_MMAP_THRESHOLD parameter.
    

    因此,默认情况下,运行时将直接从OS分配128K或更多的分配,并免费释放回OS.因此,有时您会看到自己可能一直期望的行为.

  2. 性能!如上面的其他列表所述,每个内核调用都非常昂贵.稍后将需要由进程释放的内存,以满足另一分配.与其尝试将其返回给操作系统,而是一项相对繁重的操作,不如不将其保留在免费列表上,以满足将来的分配需求?正如手册页条目中指出的那样,这还避免了将内核返回的所有内存清零的开销.由于进程不断地重复使用地址空间的相同区域,因此它也提供了良好的缓存行为的最佳机会.最后,它避免了由munmap(可能通过brk缩小)施加的TLB刷新.
  3. 不返回内存的问题"对于寿命长的进程来说是最糟糕的,这些进程会在某个时候分配一堆内存,释放它,然后再也不分配那么多.即,分配高水位标记的过程大于其长期典型分配量.但是,大多数过程都不遵循这种模式.进程通常会释放大量内存,但是分配的速率应使它们的整体内存使用量恒定或可能增加.确实具有先有先有"模式的应用程序可能会malloc_trim强制出现此问题.
  4. 虚拟内存有助于缓解此问题.到目前为止,我一直在使用诸如分配的内存"之类的术语,但并未真正定义其含义.如果程序分配了内存,然后释放了2 GB的内存,然后闲着无所事事,那是不是在浪费2 GB的实际DRAM插入主板的某处?可能不是.当然,它在您的进程中使用2 GB的虚拟地址空间,但是虚拟地址空间是每个进程的,因此不会直接从其他进程中夺走任何东西.如果该进程确实在某个时候写入了内存,则会为它分配物理内存(是的,是DRAM)-释放它之后,根据定义,您将不再使用它.此时,操作系统可能会通过供他人使用来回收这些物理页面.

    现在这仍然需要您进行交换才能吸收脏的未使用页面,但是某些分配器很聪明:它们可以发出munmap和后续的mmap效率更高,但避免了无意义地将释放的内存区域交换来交换. 2

示范代码

此答案中所指出的那样,您对vector<int>进行的测试实际上并没有在测试任何内容,因为空的,未使用的std::vector<int> v甚至不会创建矢量对象,只要您是使用一些最低级别的优化.即使没有优化,也不会发生分配,因为大多数vector实现都是在第一次插入时分配的,而不是在构造函数中的.最后,即使您使用的是一些不常见的编译器或库来进行分配,它也会占用很少的字节,而不是Valgrind报告的〜72,000字节.

您应该执行以下操作才能实际看到向量分配的影响:

#include <vector>

volatile vector<int> *sink;

int main() {
    std::vector<int> v(12345678);
    sink = &v;
}

这导致实际分配和取消分配.但是,由于将在程序退出之前正确释放向量分配,因此不会更改Valgrind的输出,因此就Valgrind而言,这没有任何问题.

从高层次上讲,Valgrind基本上将事物分类为确定的泄漏"和在出口未释放".前者发生在程序不再引用指向其分配的内存的指针时.它无法释放这种内存,因此泄漏了它.退出时尚未释放的内存可能是泄漏",即应该释放的对象,但也可能只是开发人员知道可以使用程序长度的内存,因此不需要被明确释放(由于全局变量的销毁顺序问题,尤其是在涉及共享库时,即使您愿意,也很难可靠地释放与全局或静态对象关联的内存).


1 在某些情况下,某些故意的特殊分配可能会超出该过程,例如共享内存和内存映射文件,但与纯C ++分配和在本讨论中,您可以忽略它.

2 最近的Linux内核还具有特定于Linux的MADV_FREE,其语义似乎与MADV_DONTNEED相似.

Yes, this might be the third time you see this code, because I asked two other questions about it (this and this).. The code is fairly simple:

#include <vector>
int main() {
    std::vector<int> v;
}

Then I build and run it with Valgrind on Linux:

g++ test.cc && valgrind ./a.out
==8511== Memcheck, a memory error detector
...
==8511== HEAP SUMMARY:
==8511==     in use at exit: 72,704 bytes in 1 blocks
==8511==   total heap usage: 1 allocs, 0 frees, 72,704 bytes allocated
==8511==
==8511== LEAK SUMMARY:
==8511==    definitely lost: 0 bytes in 0 blocks
==8511==    indirectly lost: 0 bytes in 0 blocks
==8511==      possibly lost: 0 bytes in 0 blocks
==8511==    still reachable: 72,704 bytes in 1 blocks
==8511==         suppressed: 0 bytes in 0 blocks
...
==8511== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Here, Valgrind reports there is no memory leak, even though there is 1 alloc and 0 free.

The answer here points out the allocator used by C++ standard library don't necessarily return the memory back to the OS - it might keep them in an internal cache.

Question is:

1) Why keep them in an internal cache? If it is for speed, how is it faster? Yes, the OS needs to maintain a data structure to keep track of memory allocation, but this the maintainer of this cache also needs to do so.

2) How is this implemented? Because my program a.out terminates already, there is no other process that is maintaining this memory cache - or, is there one?

Edit: for question (2) - Some answers I've seen suggest "C++ runtime", what does it mean? If "C++ runtime" is the C++ library, but the library is just a bunch of machine code sitting on the disk, it is not a running process - the machine code is either linked to my a.out (static library, .a) or is invoked during runtime (shared objects, .so) in the process of a.out.

解决方案

Clarification

First, some clarification. You asked: ... my program a.out terminates already, there is no other process that is maintaining this memory cache - or, is there one?

Everything we are talking about is within the lifetime of a single process: the process always returns all allocated memory when it exits. There is no cache that outlives the process1. The memory is returned even without any help from the runtime allocator: the OS simply "takes it back" when the process is terminated. So there is no system-wide leak possible from terminated applications with normal allocations.

Now what Valgrind is reporting is memory that is in use at the moment the process terminated, but before the OS cleans everything up. It works at the runtime library level, and not at the OS level. So it's saying "Hey, when the program finished, there were 72,000 bytes that hadn't been returned to the runtime" but an unstated implication is that "these allocations will be cleaned up shortly by the OS".

The Underlying Questions

The code and Valgrind output shown doesn't really correlate well with the titular question, so let's break them apart. First we'll just try to answer the questions you asked about allocators: why they exist and why they don't generally don't immediately return freed memory to the OS, ignoring the example.

You asked:

1) Why keep them in an internal cache? If it is for speed, how is it faster? Yes, the OS needs to maintain a data structure to keep track of memory allocation, but this the maintainer of this cache also needs to do so.

This is sort of two questions in one: one is why bother having a userland runtime allocator at all, and then the other one is (perhaps?) why don't these allocators immediately return memory to the OS when it is freed. They are related, but let's tackle them one at a time.

Why Runtime Allocators Exist

Why not just rely on the OS memory allocation routines?

  • Many operating systems, including most Linux and other Unix-like operating systems, simply don't have an OS system call to allocate and free arbitrary blocks of memory. Unix-alikes offer brk which only grows or shrinks one contiguous block of memory - you have no way to "free" arbitrary earlier allocations. They also offer mmap which allows you to independently allocate and free chunks of memory, but these allocate on a PAGE_SIZE granularity, which on Linux is 4096 bytes. So if you want request 32 bytes, you'll have to waste 4096 - 32 == 4064 bytes if you don't have your own allocator. On these operating systems you practically need a separate memory allocation runtime which turns these coarse-grained tools into something capable of efficiently allocating small blocks.

    Windows is a bit different. It has the HeapAlloc call, which is part of the "OS" and does offer malloc-like capabilities of allocating and freeing arbitrarily sized chunks of memory. With some compilers then, malloc is just implemented as a thin wrapper around HeapAlloc (the performance of this call has improved greatly in recent Windows versions, making this feasible). Still, while HeapAlloc is part of the OS it isn't implemented in the kernel - it is also mostly implemented in a user-mode library, managing a list of free and used blocks, with occasional kernel calls to get chunks of memory from the kernel. So it is mostly malloc in another disguise and any memory it is holding on to is also not available to any other processes.

  • Performance! Even if there were appropriate kernel-level calls to allocate arbitrary blocks of memory, the simple overhead roundtrip to the kernel is usually hundreds of nanoseconds or more. A well-tuned malloc allocation or free, on other hand, is often only a dozen instructions and may complete in 10 ns or less. On top of that, system calls can't "trust their input" and so must carefully validate parameters passed from user-space. In the case of free this means that it much check that the user passed a pointer which is valid! Most runtime free implements simply crash or silently corrupt memory since there is no responsibility to protect a process from itself.
  • Closer link to the rest of the language runtime. The functions you use to allocate memory in C++, namely new, malloc and friends, are part of an defined by the language. It is then entirely natural to implement them as part of the runtime that implements the rest of the language, rather than the OS which is for the most part language-agnostic. For example, the language may have specific alignment requirements for various objects, which can best be handled by language aware allocators. Changes to the language or compiler might also imply necessary changes to the allocation routines, and it would be a tough call to hope for the kernel to be updated to accommodate your language features!

Why Not Return Memory to the OS

Your example doesn't show it, but you asked and if you wrote a different test you would probably find that after allocating and then freeing a bunch of memory, your processes resident set size and/or virtual size as reported by the OS might not decrease after the free. That is, it seems like the process holds on to the memory even though you have freed it. This is in fact true of many malloc implementations. First, note that this is not a leak per se - the unreturned memory is still available to the process that allocated it, even if not to other processes.

Why do they do that? Here are some reasons:

  1. The kernel API makes it hard. For the old-school brk and sbrk system calls, it simply isn't feasible to return freed memory unless it happens to be at the end of very last block allocated from brk or sbrk. That's because the abstraction offered by these calls is a single large contiguous region that you can only extend from one end. You can't hand back memory from the middle of it. Rather than trying to support the unusual case where all the freed memory happens to be at the end of brk region, most allocators don't even bother.

    The mmap call is more flexible (and this discussion generally applies also to Windows where VirtualAlloc is the mmap equivalent), allowing you to at least return memory at a page granularity - but even that is hard! You can't return a page until all allocations that are part of that page are freed. Depending on the size and allocation/free pattern of the application that may be common or uncommon. A case where it works well is for large allocations - greater than a page. Here you're guaranteed to be able to free most of the allocation if it was done via mmap and indeed some modern allocators satisfy large allocations directly from mmap and free them back to the OS with munmap. For glibc (and by extension the C++ allocation operators), you can even control this threshold:

    M_MMAP_THRESHOLD
      For allocations greater than or equal to the limit specified
      (in bytes) by M_MMAP_THRESHOLD that can't be satisfied from
      the free list, the memory-allocation functions employ mmap(2)
      instead of increasing the program break using sbrk(2).
    
      Allocating memory using mmap(2) has the significant advantage
      that the allocated memory blocks can always be independently
      released back to the system.  (By contrast, the heap can be
      trimmed only if memory is freed at the top end.)  On the other
      hand, there are some disadvantages to the use of mmap(2):
      deallocated space is not placed on the free list for reuse by
      later allocations; memory may be wasted because mmap(2)
      allocations must be page-aligned; and the kernel must perform
      the expensive task of zeroing out memory allocated via
      mmap(2).  Balancing these factors leads to a default setting
      of 128*1024 for the M_MMAP_THRESHOLD parameter.
    

    So by default allocations of 128K or more will be allocated by the runtime directly from the OS and freed back to the OS on free. So sometimes you will see the behavior you might have expected is always the case.

  2. Performance! Every kernel call is expensive, as described in the other list above. Memory that is freed by a process will be needed shortly later to satisfy another allocation. Rather than trying to return it to the OS, a relatively heavyweight operation, why not just keep it around on a free list to satisfy future allocations? As pointed out in the man page entry, this also avoids the overhead of zeroing out all the memory returned by the kernel. It also gives the best chance of good cache behavior since the process is continually re-using the same region of the address space. Finally, it avoids TLB flushes which would be imposed by munmap (and possibly by shrinking via brk).
  3. The "problem" of not returning memory is the worst for long-lived processes that allocate a bunch of memory at some point, free it and then never allocate that much again. I.e., processes whose allocation high-water mark is larger than their long term typical allocation amount. Most processes just don't follow that pattern, however. Processes often free a lot of memory, but allocate at a rate such that their overall memory use is constant or perhaps increasing. Applications that do have the "big then small" live size pattern could perhaps force the issue with malloc_trim.
  4. Virtual memory helps mitigate the issue. So far I've been throwing around terms like "allocated memory" without really defining what it means. If a program allocates and then frees 2 GB of memory and then sits around doing nothing, is it wasting 2 GB of actual DRAM plugged into your motherboard somewhere? Probably not. It is using 2 GB of virtual address space in your process, sure, but virtual address space is per-process, so that doesn't directly take anything away from other processes. If the process actually wrote to the memory at some point, it would be allocated physical memory (yes, DRAM) - after freeing it, you are - by definition - no longer using it. At this point the OS may reclaim those physical pages by use for someone else.

    Now this still requires you have swap to absorb the dirty not-used pages, but some allocators are smart: they can issue a madvise(..., MADV_DONTNEED) call which tells the OS "this range doesn't have anything useful, you don't have to preserve its contents in swap". It still leaves the virtual address space mapped in the process and usable later (zero filled) and so it's more efficient than munmap and a subsequent mmap, but it avoid pointlessly swapping freed memory regions to swap.2

The Demonstrated Code

As pointed out in this answer your test with vector<int> isn't really testing anything because an empty, unused std::vector<int> v won't even create the vector object as long as you are using some minimal level of optimization. Even without optimization, no allocation is likely to occur because most vector implementations allocate on first insertion, and not in the constructor. Finally, even if you are using some unusual compiler or library that does an allocation, it will be for a handful of bytes, not the ~72,000 bytes Valgrind is reporting.

You should do something like this to actually see the impact of a vector allocation:

#include <vector>

volatile vector<int> *sink;

int main() {
    std::vector<int> v(12345678);
    sink = &v;
}

That results in actual allocation and de-allocation. It isn't going to change the Valgrind output, however, since the vector allocation is correctly freed before the program exits, so there is no issue as far as Valgrind is concerned.

At a high level, Valgrind basically categorizes things into "definite leaks" and "not freed at exit". The former occur when the program no longer has a reference to a pointer to memory that it allocated. It cannot free such memory and so has leaked it. Memory which hasn't been freed at exit may be a "leak" - i.e., objects that should have been freed, but it may also simply be memory that the developer knew would live the length of the program and so doesn't need to be explicitly freed (because of order-of-destruction issues for globals, especially when shared libraries are involved, it may be very hard to reliably free memory associated with global or static objects even if you wanted to).


1 In some cases some deliberately special allocations may outlive the process, such as shared memory and memory mapped files, but that doesn't relate to plain C++ allocations and you can ignore it for the purposes of this discussion.

2 Recent Linux kernels also have the Linux-specific MADV_FREE which seems to have similar semantics to MADV_DONTNEED.

这篇关于为什么内存分配器不主动将释放的内存返回给操作系统?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆