为什么 malloc+memset 比 calloc 慢? [英] Why malloc+memset is slower than calloc?

查看:40
本文介绍了为什么 malloc+memset 比 calloc 慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

众所周知,callocmalloc 的不同之处在于它初始化分配的内存.使用 calloc,内存设置为零.使用malloc,不会清除内存.

所以在日常工作中,我把calloc当成malloc+memset.顺便说一句,为了好玩,我编写了以下代码作为基准测试.

结果令人困惑.

代码 1:

#include#include#定义块大小 1024*1024*256int main(){国际我= 0;字符 *buf[10];而(i<10){buf[i] = (char*)calloc(1,BLOCK_SIZE);我++;}}

代码 1 的输出:

time ./a.out**实际 0m0.287s**用户 0m0.095s系统 0m0.192s

代码 2:

#include#include#include#定义块大小 1024*1024*256int main(){国际我= 0;字符 *buf[10];而(i<10){buf[i] = (char*)malloc(BLOCK_SIZE);memset(buf[i],'',BLOCK_SIZE);我++;}}

代码 2 的输出:

time ./a.out**真正的 0m2.693s**用户 0m0.973s系统 0m1.721s

在代码 2 中用 bzero(buf[i],BLOCK_SIZE) 替换 memset 产生相同的结果.

我的问题是: 为什么 malloc+memsetcalloc 慢这么多?calloc 如何做到这一点?

解决方案

简短版本:始终使用 calloc() 而不是 malloc()+memset().在大多数情况下,它们是相同的.在某些情况下,calloc() 会做更少的工作,因为它可以完全跳过 memset().在其他情况下,calloc() 甚至可以作弊而不分配任何内存!但是,malloc()+memset() 将始终完成全部工作.

理解这一点需要对内存系统进行简短的浏览.

快速浏览记忆

这里有四个主要部分:你的程序、标准库、内核和页表.你已经知道你的程序了,所以...

malloc()calloc() 这样的内存分配器主要用于进行小量分配(从 1 字节到 100 KB 的任何内容)并将它们分组到更大的池中的记忆.例如,如果您分配 16 个字节,malloc() 将首先尝试从其中一个池中获取 16 个字节,然后在池耗尽时从内核请求更多内存.但是,由于您要询问的程序一次分配大量内存,malloc()calloc() 将直接从内核.此行为的阈值取决于您的系统,但我已经看到 1 MiB 用作阈值.

内核负责为每个进程分配实际的 RAM,并确保进程不会干扰其他进程的内存.这称为内存保护,自 1990 年代以来,它一直很常见,这就是为什么一个程序可以在不关闭整个系统的情况下崩溃的原因.因此,当程序需要更多内存时,它不能只占用内存,而是使用像 mmap()sbrk()<这样的系统调用从内核请求内存/代码>.内核会通过修改页表给每个进程分配内存.

页表将内存地址映射到实际的物理 RAM.您的进程的地址,在 32 位系统上为 0x00000000 到 0xFFFFFFFF,不是真实内存,而是虚拟内存中的地址.处理器将这些地址划分为 4 KiB 页,每个页都可以分配通过修改页表到不同的物理 RAM.只允许内核修改页表.

它如何不起作用

以下是如何分配 256 MiB 不起作用:

  1. 您的进程调用 calloc() 并要求 256 MiB.

  2. 标准库调用 mmap() 并要求 256 MiB.

  3. 内核找到 256 MiB 未使用的 RAM,并通过修改页表将其提供给您的进程.

  4. 标准库使用 memset() 将 RAM 清零并从 calloc() 返回.

  5. 您的进程最终会退出,内核会回收 RAM 以便其他进程使用.

实际工作原理

上述过程会起作用,但它不会以这种方式发生.存在三个主要差异.

  • 当您的进程从内核获取新内存时,该内存可能之前已被其他某个进程使用.这是一个安全风险.如果该内存有密码、加密密钥或秘密莎莎食谱怎么办?为了防止敏感数据泄漏,内核总是在将内存提供给进程之前清理内存.我们不妨通过清零来清理内存,如果新内存清零,我们也可以保证它,所以 mmap() 保证它返回的新内存始终为零.p>

  • 有很多程序会分配内存但不会立即使用内存.有时分配了内存但从未使用过.内核知道这一点并且很懒惰.当您分配新内存时,内核根本不会接触页表,也不会给您的进程提供任何 RAM.相反,它会在您的进程中找到一些地址空间,记下应该去那里的内容,并承诺如果您的程序实际使用它,它将把 RAM 放在那里.当您的程序尝试从这些地址读取或写入时,处理器会触发一个页面错误,内核会逐步将 RAM 分配给这些地址并恢复您的程序.如果你从不使用内存,页面错误就永远不会发生,你的程序也永远不会真正获得 RAM.

  • 有些进程分配内存,然后在不修改的情况下从中读取.这意味着跨不同进程的内存中的许多页面可能会被 mmap() 返回的原始零填充.由于这些页面都是相同的,因此内核使所有这些虚拟地址指向一个由零填充的共享 4 KiB 内存页面.如果您尝试写入该内存,处理器会触发另一个页面错误,内核会介入,为您提供一个新的零页面,该页面不与任何其他程序共享.

最终的过程看起来更像这样:

  1. 您的进程调用 calloc() 并要求 256 MiB.

  2. 标准库调用 mmap() 并要求 256 MiB.

  3. 内核找到 256 MiB 未使用的地址空间,记录下该地址空间现在的用途,然后返回.

  4. 标准库知道 mmap() 的结果总是用零填充(或者 将是 一旦它实际获得一些 RAM),所以它不接触内存,所以没有页面错误,RAM 永远不会分配给你的进程.

  5. 您的进程最终会退出,并且内核不需要回收 RAM,因为它从一开始就从未分配过.

如果您使用 memset() 将页面归零,memset() 将触发页面错误,导致 RAM 被分配,然后将其归零,即使它已经充满了零.这是大量的额外工作,并解释了为什么 calloc()malloc()memset() 更快.如果最终还是使用了内存,calloc() 仍然比 malloc()memset() 快,但差别不大可笑.

<小时>

这并不总是有效

并非所有系统都有分页虚拟内存,因此并非所有系统都可以使用这些优化.这适用于 80286 等非常老旧的处理器以及对于复杂的内存管理单元来说太小的嵌入式处理器.

这也不适用于较小的分配.对于较小的分配,calloc() 从共享池中获取内存,而不是直接进入内核.通常,共享池中可能存储有来自旧内存的垃圾数据,这些数据来自于 free() 使用和释放的旧内存,因此 calloc() 可以获取该内存并调用memset() 清除它.常见的实现会跟踪共享池的哪些部分是原始的并且仍然用零填充,但并非所有实现都这样做.

消除一些错误的答案

根据操作系统的不同,内核在空闲时间可能会也可能不会清零内存,以防您以后需要获取一些清零内存.Linux 不会提前清零内存,Dragonfly BSD 最近也去掉了这个功能从他们的内核.但是,其他一些内核会提前执行零内存.无论如何,在空闲状态下将页面清零并不足以解释巨大的性能差异.

calloc() 函数没有使用一些特殊的内存对齐版本的 memset(),无论如何这不会使它更快.现代处理器的大多数 memset() 实现看起来像这样:

function memset(dest, c, len)//一次一个字节,直到 dest 对齐...while (len > 0 && ((unsigned int)dest & 15))*dest++ = c长度 -= 1//现在一次写入大块(特定于处理器)...//块大小可能不是 16,它只是伪代码而 (len >= 16)//一些优化的向量代码放在这里//glibc 在可用时使用 SSE2目标 += 16长度 -= 16//结尾没有对齐,所以一次一个字节而 (len > 0)*dest++ = c长度 -= 1

所以你可以看到,memset() 非常快,对于大内存块,你真的不会得到更好的东西.

memset() 正在清零已经清零的内存这一事实确实意味着内存被清零两次,但这只能解释 2 倍的性能差异.这里的性能差异要大得多(我在我的系统上测量了 malloc()+memset()calloc() 之间的三个数量级以上).

派对技巧

不要循环 10 次,而是编写一个分配内存的程序,直到 malloc()calloc() 返回 NULL.

如果添加 memset() 会发生什么?

It's known that calloc is different than malloc in that it initializes the memory allocated. With calloc, the memory is set to zero. With malloc, the memory is not cleared.

So in everyday work, I regard calloc as malloc+memset. Incidentally, for fun, I wrote the following code for a benchmark.

The result is confusing.

Code 1:

#include<stdio.h>
#include<stdlib.h>
#define BLOCK_SIZE 1024*1024*256
int main()
{
        int i=0;
        char *buf[10];
        while(i<10)
        {
                buf[i] = (char*)calloc(1,BLOCK_SIZE);
                i++;
        }
}

Output of Code 1:

time ./a.out  
**real 0m0.287s**  
user 0m0.095s  
sys 0m0.192s  

Code 2:

#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#define BLOCK_SIZE 1024*1024*256
int main()
{
        int i=0;
        char *buf[10];
        while(i<10)
        {
                buf[i] = (char*)malloc(BLOCK_SIZE);
                memset(buf[i],'',BLOCK_SIZE);
                i++;
        }
}

Output of Code 2:

time ./a.out   
**real 0m2.693s**  
user 0m0.973s  
sys 0m1.721s  

Replacing memset with bzero(buf[i],BLOCK_SIZE) in Code 2 produces the same result.

My question is: Why is malloc+memset so much slower than calloc? How can calloc do that?

解决方案

The short version: Always use calloc() instead of malloc()+memset(). In most cases, they will be the same. In some cases, calloc() will do less work because it can skip memset() entirely. In other cases, calloc() can even cheat and not allocate any memory! However, malloc()+memset() will always do the full amount of work.

Understanding this requires a short tour of the memory system.

Quick tour of memory

There are four main parts here: your program, the standard library, the kernel, and the page tables. You already know your program, so...

Memory allocators like malloc() and calloc() are mostly there to take small allocations (anything from 1 byte to 100s of KB) and group them into larger pools of memory. For example, if you allocate 16 bytes, malloc() will first try to get 16 bytes out of one of its pools, and then ask for more memory from the kernel when the pool runs dry. However, since the program you're asking about is allocating for a large amount of memory at once, malloc() and calloc() will just ask for that memory directly from the kernel. The threshold for this behavior depends on your system, but I've seen 1 MiB used as the threshold.

The kernel is responsible for allocating actual RAM to each process and making sure that processes don't interfere with the memory of other processes. This is called memory protection, it has been dirt common since the 1990s, and it's the reason why one program can crash without bringing down the whole system. So when a program needs more memory, it can't just take the memory, but instead it asks for the memory from the kernel using a system call like mmap() or sbrk(). The kernel will give RAM to each process by modifying the page table.

The page table maps memory addresses to actual physical RAM. Your process's addresses, 0x00000000 to 0xFFFFFFFF on a 32-bit system, aren't real memory but instead are addresses in virtual memory. The processor divides these addresses into 4 KiB pages, and each page can be assigned to a different piece of physical RAM by modifying the page table. Only the kernel is permitted to modify the page table.

How it doesn't work

Here's how allocating 256 MiB does not work:

  1. Your process calls calloc() and asks for 256 MiB.

  2. The standard library calls mmap() and asks for 256 MiB.

  3. The kernel finds 256 MiB of unused RAM and gives it to your process by modifying the page table.

  4. The standard library zeroes the RAM with memset() and returns from calloc().

  5. Your process eventually exits, and the kernel reclaims the RAM so it can be used by another process.

How it actually works

The above process would work, but it just doesn't happen this way. There are three major differences.

  • When your process gets new memory from the kernel, that memory was probably used by some other process previously. This is a security risk. What if that memory has passwords, encryption keys, or secret salsa recipes? To keep sensitive data from leaking, the kernel always scrubs memory before giving it to a process. We might as well scrub the memory by zeroing it, and if new memory is zeroed we might as well make it a guarantee, so mmap() guarantees that the new memory it returns is always zeroed.

  • There are a lot of programs out there that allocate memory but don't use the memory right away. Some times memory is allocated but never used. The kernel knows this and is lazy. When you allocate new memory, the kernel doesn't touch the page table at all and doesn't give any RAM to your process. Instead, it finds some address space in your process, makes a note of what is supposed to go there, and makes a promise that it will put RAM there if your program ever actually uses it. When your program tries to read or write from those addresses, the processor triggers a page fault and the kernel steps in assign RAM to those addresses and resumes your program. If you never use the memory, the page fault never happens and your program never actually gets the RAM.

  • Some processes allocate memory and then read from it without modifying it. This means that a lot of pages in memory across different processes may be filled with pristine zeroes returned from mmap(). Since these pages are all the same, the kernel makes all these virtual addresses point a single shared 4 KiB page of memory filled with zeroes. If you try to write to that memory, the processor triggers another page fault and the kernel steps in to give you a fresh page of zeroes that isn't shared with any other programs.

The final process looks more like this:

  1. Your process calls calloc() and asks for 256 MiB.

  2. The standard library calls mmap() and asks for 256 MiB.

  3. The kernel finds 256 MiB of unused address space, makes a note about what that address space is now used for, and returns.

  4. The standard library knows that the result of mmap() is always filled with zeroes (or will be once it actually gets some RAM), so it doesn't touch the memory, so there is no page fault, and the RAM is never given to your process.

  5. Your process eventually exits, and the kernel doesn't need to reclaim the RAM because it was never allocated in the first place.

If you use memset() to zero the page, memset() will trigger the page fault, cause the RAM to get allocated, and then zero it even though it is already filled with zeroes. This is an enormous amount of extra work, and explains why calloc() is faster than malloc() and memset(). If end up using the memory anyway, calloc() is still faster than malloc() and memset() but the difference is not quite so ridiculous.


This doesn't always work

Not all systems have paged virtual memory, so not all systems can use these optimizations. This applies to very old processors like the 80286 as well as embedded processors which are just too small for a sophisticated memory management unit.

This also won't always work with smaller allocations. With smaller allocations, calloc() gets memory from a shared pool instead of going directly to the kernel. In general, the shared pool might have junk data stored in it from old memory that was used and freed with free(), so calloc() could take that memory and call memset() to clear it out. Common implementations will track which parts of the shared pool are pristine and still filled with zeroes, but not all implementations do this.

Dispelling some wrong answers

Depending on the operating system, the kernel may or may not zero memory in its free time, in case you need to get some zeroed memory later. Linux does not zero memory ahead of time, and Dragonfly BSD recently also removed this feature from their kernel. Some other kernels do zero memory ahead of time, however. Zeroing pages durign idle isn't enough to explain the large performance differences anyway.

The calloc() function is not using some special memory-aligned version of memset(), and that wouldn't make it much faster anyway. Most memset() implementations for modern processors look kind of like this:

function memset(dest, c, len)
    // one byte at a time, until the dest is aligned...
    while (len > 0 && ((unsigned int)dest & 15))
        *dest++ = c
        len -= 1
    // now write big chunks at a time (processor-specific)...
    // block size might not be 16, it's just pseudocode
    while (len >= 16)
        // some optimized vector code goes here
        // glibc uses SSE2 when available
        dest += 16
        len -= 16
    // the end is not aligned, so one byte at a time
    while (len > 0)
        *dest++ = c
        len -= 1

So you can see, memset() is very fast and you're not really going to get anything better for large blocks of memory.

The fact that memset() is zeroing memory that is already zeroed does mean that the memory gets zeroed twice, but that only explains a 2x performance difference. The performance difference here is much larger (I measured more than three orders of magnitude on my system between malloc()+memset() and calloc()).

Party trick

Instead of looping 10 times, write a program that allocates memory until malloc() or calloc() returns NULL.

What happens if you add memset()?

这篇关于为什么 malloc+memset 比 calloc 慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆