我可以要求内核填充(故障排除)一系列匿名页面吗? [英] Can I ask the kernel to populate (fault in) a range of anonymous pages?

查看:63
本文介绍了我可以要求内核填充(故障排除)一系列匿名页面吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Linux中,如果使用C,如果我通过malloc或类似的动态分配机制请求大量内存,则支持返回区域的大多数页面可能实际上不会映射到该地址中我的过程空间.

In Linux, using C, if I ask for a large amount of memory via malloc or a similar dynamic allocation mechanism, it is likely that most of the pages backing the returned region won't actually be mapped into the address space of my process.

相反,每次我第一次访问分配的页面时,都会发生页面错误,然后内核将映射到匿名"页面(完全由零组成)并返回用户空间.

Instead, a page fault is incurred each time I access one of the allocated pages for the first time, and then kernel will map in the "anonymous" page (consisting entirely of zeros) and return to user space.

对于较大的区域(例如1 GiB),这是大量的页面错误(对于4 KiB页面,约为26万个页面错误),并且 each 错误会导致用户向内核用户过渡在使用Spectre和Meltdown缓解措施的内核上,这特别慢.对于某些用途,此页面错误时间可能会占据缓冲区实际执行的工作.

For a large region (say 1 GiB) this is a large number of page faults (~260 thousand for 4 KiB pages), and each fault incurs a user-to-kernel-user transition which are especially slow on kernels with Spectre and Meltdown mitigations. For some uses, this page-faulting time might dominate the actual work being done on the buffer.

如果我知道我将使用整个缓冲区,是否有某种方法可以要求内核提前映射已映射区域?

If I know I'm going to use the entire buffer, is there some way to ask the kernel to map an already mapped region ahead of time?

如果我使用mmap分配自己的内存,则可以使用MAP_POPULATE-但这不适用于从mallocnew接收的区域.

If I was allocating my own memory using mmap, the way to do this would be MAP_POPULATE - but that doesn't work for regions received from malloc or new.

有一个madvise调用,但是那里的选项似乎主要适用于文件支持的区域.例如,madvise(..., MADV_WILLNEED)调用似乎很有希望-在手册页中:

There is the madvise call, but the options there seem mostly to apply to file-backed regions. For example, the madvise(..., MADV_WILLNEED) call seems promising - from the man page:

MADV_WILLNEED

MADV_WILLNEED

在不久的将来希望访问. (因此,最好先阅读一些页面.)

Expect access in the near future. (Hence, it might be a good idea to read some pages ahead.)

明显的含义是,如果该区域是文件支持的,则此调用可能会触发异步文件预读,或者在后续故障时触发同步的附加预读.从描述中,尚不清楚它是否会对匿名页面做任何事情,并且根据我的测试,它不会.

The obvious implication is if the region is file-backed, this call might trigger an asynchronous file read-ahead, or perhaps a synchronous additional read-ahead on subsequent faults. From the description, it isn't clear if it will do anything for anonymous pages, and based on my testing, it doesn't.

推荐答案

这有点肮脏,最适合特权进程或RLIMIT_MEMLOCK较高的系统,但是... mlockmunlock对将达到您想要的效果.

It's a bit of a dirty hack, and works best for priviledged processes or on systems with a high RLIMIT_MEMLOCK, but... an mlock and munlock pair will achieve the effect you are looking for.

例如,给定以下测试程序:

For example, given the following test program:

# compile with (for e.g.,): cc -O1 -Wall    pagefaults.c   -o pagefaults

#include <stdlib.h>
#include <stdio.h>
#include <err.h>
#include <sys/mman.h>

#define DEFAULT_SIZE        (40 * 1024 * 1024)
#define PG_SIZE     4096

void failcheck(int ret, const char* what) {
    if (ret) {
        err(EXIT_FAILURE, "%s failed", what);
    } else {
        printf("%s OK\n", what);
    }
}

int main(int argc, char **argv) {
    size_t size = (argc == 2 ? atol(argv[1]) : DEFAULT_SIZE);
    char *mem = malloc(size);

    if (getenv("DO_MADVISE")) {
        failcheck(madvise(mem, size, MADV_WILLNEED), "madvise");
    }

    if (getenv("DO_MLOCK")) {
        failcheck(mlock(mem, size), "mlock");
        failcheck(munlock(mem, size), "munlock");
    }

    for (volatile char *p = mem; p < mem + size; p += PG_SIZE) {
        *p = 'z';
    }
    printf("size: %6.2f MiB, pages touched: %zu\npoitner value : %p\n",
            size / 1024. / 1024., size / PG_SIZE, mem);
}

以root身份在1 GB的区域中运行它,并用perf计数pagefault会导致:

Running it as root for a 1 GB region and counting pagefaults with perf results in:

$ perf stat ./pagefaults 1000000000
size: 953.67 MiB, pages touched: 244140
poitner value : 0x7f2fc2584010

 Performance counter stats for './pagefaults 1000000000':

        352.474676      task-clock (msec)         #    0.999 CPUs utilized          
                 2      context-switches          #    0.006 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
           244,189      page-faults               #    0.693 M/sec                  
       914,276,474      cycles                    #    2.594 GHz                    
       703,359,688      instructions              #    0.77  insn per cycle         
       117,710,381      branches                  #  333.954 M/sec                  
           447,022      branch-misses             #    0.38% of all branches        

       0.352814087 seconds time elapsed

但是,如果以DO_MLOCK=1为前缀运行,则会得到:

However, if you run prefixed with DO_MLOCK=1, you get:

sudo DO_MLOCK=1 perf stat ./pagefaults 1000000000
mlock OK
munlock OK
size: 953.67 MiB, pages touched: 244140
poitner value : 0x7f8047f6b010

 Performance counter stats for './pagefaults 1000000000':

        240.236189      task-clock (msec)         #    0.999 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                49      page-faults               #    0.204 K/sec                  
       623,152,764      cycles                    #    2.594 GHz                    
       959,640,219      instructions              #    1.54  insn per cycle         
       150,713,144      branches                  #  627.354 M/sec                  
           484,400      branch-misses             #    0.32% of all branches        

       0.240538327 seconds time elapsed

请注意,页面错误的数量已从244,189减少到49,并且加速了1.46倍.绝大多数时间仍然花在内核上,因此如果不必同时调用mlockmunlock,并且可能由于mlock的语义比是必需的.

Note that the number of page faults has dropped from 244,189 to 49, and there is a 1.46x speedup. The overwhelming majority of the time is still spend in the kernel, so this could probably be a lot faster if it wasn't necessary to invoke both mlock and munlock and possibly also because the semantics of mlock are more than is required.

对于非特权进程,如果您尝试一次同时做一个大区域(在我的Ubuntu系统上,设置为64 Kib),您可能会打RLIMIT_MEMLOCK,但是您可以循环调用mlock(); munlock()在较小的区域.

For non-privileged processes, you'll probably hit the RLIMIT_MEMLOCK if you try to do a large region all at once (on my Ubuntu system it's set at 64 Kib), but you could loop over the region calling mlock(); munlock() on a smaller region.

这篇关于我可以要求内核填充(故障排除)一系列匿名页面吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆