我可以要求内核填充(故障排除)一系列匿名页面吗? [英] Can I ask the kernel to populate (fault in) a range of anonymous pages?
问题描述
在Linux中,如果使用C,如果我通过malloc
或类似的动态分配机制请求大量内存,则支持返回区域的大多数页面可能实际上不会映射到该地址中我的过程空间.
In Linux, using C, if I ask for a large amount of memory via malloc
or a similar dynamic allocation mechanism, it is likely that most of the pages backing the returned region won't actually be mapped into the address space of my process.
相反,每次我第一次访问分配的页面时,都会发生页面错误,然后内核将映射到匿名"页面(完全由零组成)并返回用户空间.
Instead, a page fault is incurred each time I access one of the allocated pages for the first time, and then kernel will map in the "anonymous" page (consisting entirely of zeros) and return to user space.
对于较大的区域(例如1 GiB),这是大量的页面错误(对于4 KiB页面,约为26万个页面错误),并且 each 错误会导致用户向内核用户过渡在使用Spectre和Meltdown缓解措施的内核上,这特别慢.对于某些用途,此页面错误时间可能会占据缓冲区实际执行的工作.
For a large region (say 1 GiB) this is a large number of page faults (~260 thousand for 4 KiB pages), and each fault incurs a user-to-kernel-user transition which are especially slow on kernels with Spectre and Meltdown mitigations. For some uses, this page-faulting time might dominate the actual work being done on the buffer.
如果我知道我将使用整个缓冲区,是否有某种方法可以要求内核提前映射已映射区域?
If I know I'm going to use the entire buffer, is there some way to ask the kernel to map an already mapped region ahead of time?
如果我使用mmap
分配自己的内存,则可以使用MAP_POPULATE
-但这不适用于从malloc
或new
接收的区域.
If I was allocating my own memory using mmap
, the way to do this would be MAP_POPULATE
- but that doesn't work for regions received from malloc
or new
.
有一个madvise
调用,但是那里的选项似乎主要适用于文件支持的区域.例如,madvise(..., MADV_WILLNEED)
调用似乎很有希望-在手册页中:
There is the madvise
call, but the options there seem mostly to apply to file-backed regions. For example, the madvise(..., MADV_WILLNEED)
call seems promising - from the man page:
MADV_WILLNEED
MADV_WILLNEED
在不久的将来希望访问. (因此,最好先阅读一些页面.)
Expect access in the near future. (Hence, it might be a good idea to read some pages ahead.)
明显的含义是,如果该区域是文件支持的,则此调用可能会触发异步文件预读,或者在后续故障时触发同步的附加预读.从描述中,尚不清楚它是否会对匿名页面做任何事情,并且根据我的测试,它不会.
The obvious implication is if the region is file-backed, this call might trigger an asynchronous file read-ahead, or perhaps a synchronous additional read-ahead on subsequent faults. From the description, it isn't clear if it will do anything for anonymous pages, and based on my testing, it doesn't.
推荐答案
这有点肮脏,最适合特权进程或RLIMIT_MEMLOCK
较高的系统,但是... mlock
和munlock
对将达到您想要的效果.
It's a bit of a dirty hack, and works best for priviledged processes or on systems with a high RLIMIT_MEMLOCK
, but... an mlock
and munlock
pair will achieve the effect you are looking for.
例如,给定以下测试程序:
For example, given the following test program:
# compile with (for e.g.,): cc -O1 -Wall pagefaults.c -o pagefaults
#include <stdlib.h>
#include <stdio.h>
#include <err.h>
#include <sys/mman.h>
#define DEFAULT_SIZE (40 * 1024 * 1024)
#define PG_SIZE 4096
void failcheck(int ret, const char* what) {
if (ret) {
err(EXIT_FAILURE, "%s failed", what);
} else {
printf("%s OK\n", what);
}
}
int main(int argc, char **argv) {
size_t size = (argc == 2 ? atol(argv[1]) : DEFAULT_SIZE);
char *mem = malloc(size);
if (getenv("DO_MADVISE")) {
failcheck(madvise(mem, size, MADV_WILLNEED), "madvise");
}
if (getenv("DO_MLOCK")) {
failcheck(mlock(mem, size), "mlock");
failcheck(munlock(mem, size), "munlock");
}
for (volatile char *p = mem; p < mem + size; p += PG_SIZE) {
*p = 'z';
}
printf("size: %6.2f MiB, pages touched: %zu\npoitner value : %p\n",
size / 1024. / 1024., size / PG_SIZE, mem);
}
以root身份在1 GB的区域中运行它,并用perf
计数pagefault会导致:
Running it as root for a 1 GB region and counting pagefaults with perf
results in:
$ perf stat ./pagefaults 1000000000
size: 953.67 MiB, pages touched: 244140
poitner value : 0x7f2fc2584010
Performance counter stats for './pagefaults 1000000000':
352.474676 task-clock (msec) # 0.999 CPUs utilized
2 context-switches # 0.006 K/sec
0 cpu-migrations # 0.000 K/sec
244,189 page-faults # 0.693 M/sec
914,276,474 cycles # 2.594 GHz
703,359,688 instructions # 0.77 insn per cycle
117,710,381 branches # 333.954 M/sec
447,022 branch-misses # 0.38% of all branches
0.352814087 seconds time elapsed
但是,如果以DO_MLOCK=1
为前缀运行,则会得到:
However, if you run prefixed with DO_MLOCK=1
, you get:
sudo DO_MLOCK=1 perf stat ./pagefaults 1000000000
mlock OK
munlock OK
size: 953.67 MiB, pages touched: 244140
poitner value : 0x7f8047f6b010
Performance counter stats for './pagefaults 1000000000':
240.236189 task-clock (msec) # 0.999 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
49 page-faults # 0.204 K/sec
623,152,764 cycles # 2.594 GHz
959,640,219 instructions # 1.54 insn per cycle
150,713,144 branches # 627.354 M/sec
484,400 branch-misses # 0.32% of all branches
0.240538327 seconds time elapsed
请注意,页面错误的数量已从244,189减少到49,并且加速了1.46倍.绝大多数时间仍然花在内核上,因此如果不必同时调用mlock
和munlock
,并且可能由于mlock
的语义比是必需的.
Note that the number of page faults has dropped from 244,189 to 49, and there is a 1.46x speedup. The overwhelming majority of the time is still spend in the kernel, so this could probably be a lot faster if it wasn't necessary to invoke both mlock
and munlock
and possibly also because the semantics of mlock
are more than is required.
对于非特权进程,如果您尝试一次同时做一个大区域(在我的Ubuntu系统上,设置为64 Kib),您可能会打RLIMIT_MEMLOCK
,但是您可以循环调用mlock(); munlock()
在较小的区域.
For non-privileged processes, you'll probably hit the RLIMIT_MEMLOCK
if you try to do a large region all at once (on my Ubuntu system it's set at 64 Kib), but you could loop over the region calling mlock(); munlock()
on a smaller region.
这篇关于我可以要求内核填充(故障排除)一系列匿名页面吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!