Linux mremap是否不释放旧的映射? [英] Linux mremap without freeing the old mapping?

查看:51
本文介绍了Linux mremap是否不释放旧的映射?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一种无需实际复制数据即可将页面从一个虚拟地址范围复制到另一个虚拟地址范围的方法.范围很大,延迟很重要.mremap可以做到这一点,但是问题在于它也删除了旧的映射.由于我需要在多线程环境中执行此操作,因此我需要旧的映射同时使用,因此,当我确定没有其他线程可以使用它时,我将在以后释放它.在不修改内核的情况下,这是否可能(尽管很麻烦)?该解决方案只需要使用最新的Linux内核即可.

I need a way to copy pages from one virtual address range to another without actually copying the data. The ranges are massive and latency is important. mremap can do this, but the problem is it also deletes the old mapping. Since I need to do this in a multithreaded environment I need the old mapping to be simultaneously usable, I will free it later when I'm certain no other threads can be using it. Is this possible, however hacky, without modifying the kernel? The solution only need work with recent Linux kernels.

推荐答案

这是可能的,尽管您可能需要考虑特定于体系结构的缓存一致性问题.某些体系结构只是不允许在不失去一致性的情况下从多个虚拟地址同时访问同一页面.因此,某些架构可以很好地解决此问题,而其他架构则不能.

It is possible, although there are architecture-specific cache consistency issues you may need to consider. Some architectures simply do not allow the same page to be accessed from multiple virtual addresses simultaneously without losing coherency. So, some architectures will manage this fine, others do not.

编辑后添加: AMD64体系结构程序员手册,第1卷.2,系统编程,第7.8.7节更改内存类型"指出:

Edited to add: AMD64 Architecture Programmer's Manual vol. 2, System Programming, section 7.8.7 Changing Memory Type, states:

物理页不应具有通过不同的虚拟映射分配给它的不同的可缓存性类型;它们要么全部为可缓存类型(WB,WT,WP),要么全部为不可缓存类型(UC,WC,CD).否则,这可能会导致缓存一致性丢失,从而导致数据过时和无法预测的行为.

A physical page should not have differing cacheability types assigned to it through different virtual mappings; they should be either all of a cacheable type (WB, WT, WP) or all of a non-cacheable type (UC, WC, CD). Otherwise, this may result in a loss of cache coherency, leading to stale data and unpredictable behavior.

因此,在AMD64上,只要具有相同的 prot 标志,再次对相同的文件或共享内存区域进行 mmap()应该是安全的.;它应该使内核对每个映射使用相同的可缓存类型.

Thus, on AMD64, it should be safe to mmap() the same file or shared memory region again, as long as the same prot and flags are used; it should cause the kernel to use the same cacheable type to each of the mappings.

第一步是始终对内存映射使用文件支持.使用 mmap(NULL,长度,PROT_READ | PROT_WRITE,MAP_SHARED | MAP_NORESERVE,fd,0) ,以便映射不保留交换.(如果您忘记了这一点,那么比起达到许多工作负载的实际实际寿命限制,您将更早遇到交换限制.)拥有文件备份所造成的额外开销绝对是可以忽略的.

The first step is to always use a file backing for the memory maps. Use mmap(NULL, length, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, fd, 0) so that the mappings do not reserve swap. (If you forget this, you'll run into swap limits much sooner than you hit actual real life limits for many workloads.) The extra overhead caused by having a file backing is absolutely neglible.

编辑添加:用户strcmp指出,当前内核不将地址空间随机化应用于地址.幸运的是,这很容易解决,只需将随机生成的地址提供给 mmap()而不是 NULL .在x86-64上,用户地址空间为47位,该地址应按页面对齐;您可以使用例如 Xorshift * 生成地址,然后屏蔽掉不需要的位:&例如,0x00007FFFFE00000 将给出2097152字节对齐的47位地址.

Edited to add: User strcmp pointed out that current kernels do not apply address space randomization to the addresses. Fortunately, this is easy to fix, by simply supplying randomly generated addresses to mmap() instead of NULL. On x86-64, the user address space is 47-bit, and the address should be page aligned; you could use e.g. Xorshift* to generate the addresses, then mask out the unwanted bits: & 0x00007FFFFE00000 would give 2097152-byte-aligned 47-bit addresses, for example.

因为支持是文件,所以在使用

Because the backing is to a file, you can create a second mapping to the same file, after enlarging the backing file using ftruncate(). Only after a suitable grace period -- when you know no thread is using the mapping anymore (perhaps use an atomic counter to keep track of that?) --, you unmap the original mapping.

在实践中,当需要放大映射时,您首先要放大支持文件,然后尝试 mremap(映射,oldsize,newsize,0) 来查看是否可以在不移动映射的情况下增大映射.仅在就地重新映射失败时,才需要切换到新映射.

In practice, when a mapping needs to be enlarged, you first enlarge the backing file, then try mremap(mapping, oldsize, newsize, 0) to see if the mapping can be grown, without moving the mapping. Only if the in-place remapping fails, do you need to switch to the new mapping.

编辑后添加:您确实想使用 mremap()而不是仅使用 mmap() MAP_FIXED 创建一个更大的映射,因为 mmap()取消(原子上)取消任何现有的映射,包括那些属于其他文件或共享内存区域的映射.使用 mremap(),如果放大的映射与现有映射重叠,则会出现错误;使用 mmap() MAP_FIXED ,新映射重叠的任何现有映射都将被忽略(未映射).

Edited to add: You definitely do want to use mremap() instead of just using mmap() and MAP_FIXED to create a larger mapping, because mmap() unmaps (atomically) any existing mappings, including those belonging to other files or shared memory regions. With mremap(), you get an error if the enlarged mapping would overlap with existing mappings; with mmap() and MAP_FIXED, any existing mappings that the new mapping overlaps are ignored (unmapped).

不幸的是,我必须承认我还没有验证内核是否检测到现有映射之间的冲突,或者只是假设程序员知道这种冲突-毕竟,程序员必须知道每个映射的地址和长度,因此,应该知道该映射是否会与其他现有映射发生冲突.编辑添加:3.8系列内核可以这样做,如果扩大的映射会与现有映射冲突,则返回 MAP_FAILED errno == ENOMEM .除了在x86_64上对3.8.0-30-generic进行测试之外,我希望所有Linux内核的行为都相同,但是没有证据.

Unfortunately, I must admit I haven't verified if the kernel detects collisions between existing mappings, or if it just assumes the programmer knows about such collisions -- after all, the programmer must know the address and length of every mapping, and therefore should know if the mapping would collide with anther existing one. Edited to add: The 3.8 series kernels do, returning MAP_FAILED with errno==ENOMEM if the enlarged mapping would collide with existing maps. I expect all Linux kernels to behave the same way, but have no proof, aside from testing on 3.8.0-30-generic on x86_64.

还要注意,在Linux中,POSIX共享内存是使用特殊的文件系统实现的,通常是在/dev/shm (或/run/shm /dev/shm 是符号链接). shm_open() 等.全部由C库实现.我个人没有使用大型的POSIX共享内存功能,而是使用了专门安装的tmpfs来在自定义应用程序中使用.如果没有其他要求,安全控制(能够在其中创建新文件"的用户和组)更容易管理.

Also note that in Linux, POSIX shared memory is implemented using a special filesystem, typically a tmpfs mounted at /dev/shm (or /run/shm with /dev/shm being a symlink). The shm_open() et. al are implemented by the C library. Instead of having a large POSIX shared memory capability, I'd personally use a specially mounted tmpfs for use in a custom application. If not for anything else, the security controls (users and groups able to create new "files" in there) are much easier and clearer to manage.

如果映射必须是匿名的,则您仍然可以使用

If the mapping is, and has to be, anonymous, you can still use mremap(mapping, oldsize, newsize, 0) to try and resize it; it just may fail.

即使有成千上万的映射,64位地址空间也很大,并且失败的情况很少.因此,尽管您也必须处理失败的情况,但不一定必须快速.编辑修改后:在x86-64上,地址空间为47位,并且映射必须从页面边界开始(普通页面12位,2M大页面21位,1G大页面30位),因此仅映射的地址空间中有35、26或17位可用.因此,即使建议使用随机地址,冲突也更加频繁.(对于2M映射,偶尔会有1024张地图发生碰撞,但是在65536张地图上,发生碰撞(调整大小失败)的概率约为2.3%.)

Even with hundreds of thousands of mappings, the 64-bit address space is vast, and the failure case rare. So, although you must handle the failure case too, it does not necessarily have to be fast. Edited to modify: On x86-64, the address space is 47-bit, and mappings must start at a page boundary (12 bits for normal pages, 21 bits for 2M hugepages, and 30 bits for 1G hugepages), so there is only 35, 26, or 17 bits available in the address space for the mappings. So, the collisions are more frequent, even if random addresses are suggested. (For 2M mappings, 1024 maps had an occasional collision, but at 65536 maps, the probability of a collision (resize failure) was about 2.3%.)

编辑后添加:用户strcmp在注释中指出,默认情况下Linux mmap()将返回连续的地址,在这种情况下,除非最后一个映射,否则扩展映射将始终失败.地图就在那儿未映射.

Edited to add: User strcmp pointed out in a comment that by default Linux mmap() will return consecutive addresses, in which case growing the mapping will always fail unless it's the last one, or a map was unmapped just there.

我所知道的在Linux中工作的方法非常复杂,并且非常依赖于体系结构.您可以以只读方式重新映射原始映射,创建新的匿名映射,然后在其中复制旧内容.您需要一个 SIGSEGV 处理程序(针对试图写入当前只读映射的特定线程引发 SIGSEGV 信号,这是为数不多的可恢复的即使POSIX不同意,Linux中的SIGSEGV 情况也是如此,它检查导致问题的指令,对其进行仿真(改为修改新映射的内容),然后跳过有问题的指令.宽限期过后,如果不再有线程访问旧的,现在为只读的映射,则可以拆除该映射.

The approach I know works in Linux is complicated and very architecture-specific. You can remap the original mapping read-only, create a new anonymous map, and copy the old contents there. You need a SIGSEGV handler (SIGSEGV signal being raised for the particular thread that tries to write to the now read-only mapping, this being one of the few recoverable SIGSEGV situations in Linux even if POSIX disagrees) that examines the instruction that caused the problem, simulates it (modifying the contents of the new mapping instead), and then skips the problematic instruction. After a grace period, when there are no more threads accessing the old, now read-only mapping, you can tear down the mapping.

当然,所有的麻烦都在 SIGSEGV 处理程序中.它不仅必须能够解码所有机器指令并对其进行仿真(或至少模拟它们写入内存的指令),而且还必须忙于等待,如果尚未完全复制新的映射.它很复杂,绝对不可移植,并且非常特定于体系结构.但是可能.

All of the nastiness is in the SIGSEGV handler, of course. Not only must it be able to decode all machine instructions and simulate them (or at least those that write to memory), but it must also busy-wait if the new mapping has not been completely copied yet. It is complicated, absolutely unportable, and very architecture-specific.. but possible.

这篇关于Linux mremap是否不释放旧的映射?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆