用于随机读取的mmap与O_DIRECT(涉及哪些缓冲区?) [英] mmap vs O_DIRECT for random reads (what are the buffers involved?)

查看:353
本文介绍了用于随机读取的mmap与O_DIRECT(涉及哪些缓冲区?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在实现一个基于磁盘的哈希表,支持大量的密钥(2600万以上)。该值是反序列化的。读取在整个文件中基本上是随机的,值小于页面大小,而且我正在为SSD进行优化。安全性/一致性不是那么大的问题(性能问题)。

我目前的解决方案包括使用 mmap()文件与 MADV_RANDOM | MADV_DONTNEED 设置为禁止内核预取,只根据需要加载数据。



我知道内核从磁盘读取到内存缓冲区,我反序列化。

如果我调用 read(),我仍然复制到一个缓冲区(我反序列化),所以我可以获得任何优势吗?

在哪里可以找到有关 mmap()文件涉及的缓冲区的更多信息,并调用 read()打开的文件中有 O_DIRECT



我对预读或缓存不感兴趣。对于我的用例来说,它没有什么可提供的。

解决方案当数据绕过系统缓冲区时,O_DIRECT是读/写操作的选项,并将directlty从缓冲区复制到磁盘控制器。为了获得O_DIRECT的好处,需要
符合一些条件 - 保持与由I / O块对齐的内存页面缓冲区地址和缓冲区大小保持一致。



无论如何,如果您使用mmap,则不使用读/写。而且,在mmap之后,你可以关闭文件描述符,映射依然有效。因此,O_DIRECT与mmap选项无用。



我可以推荐什么来提高性能:
$ b $ ol

  • 如果你的子系统有很多搜索丢失键的请求,你可以在内存中创建Bloom filter。此后,您将匹配Bloom filter上的搜索键 http://en.wikipedia.org/wiki/Bloom_filter,并拒绝丢失的密钥,而不需要实际的请求到磁盘。

  • 为了节省内存,使用2级方案,当桶头保留在mmap ()中读取文件的桶本身。

    这两个选项我在my自动完成子系统,你可以在这里看到它: http://olegh.ftp.sh/autocomplete.html

    I am implementing a disk based hashtable supporting large amount of keys (26+ million). The value is deserialized. Reads are essentially random throughout the file, values are less than the page size, and I am optimising for SSDs. Safety/consistency are not such huge issues (performance matters).

    My current solution involves using a mmap() file with MADV_RANDOM | MADV_DONTNEED set to disable prefetching by the kernel and only load data as needed on-demand.

    I get the idea that the kernel reads from disk to memory buffer, and I deserialize from there.

    What about O_DIRECT? If I call read(), I'm still copying into a buffer (which I deserialize from) so can I gain any advantage?

    Where can I find more info on the buffers involved with a mmap() file and calling read() on a file opened with O_DIRECT?

    I am not interested in read ahead or caching. It has nothing to offer for my use case.

    解决方案

    O_DIRECT is option for read/write operations, when data bypass system buffers, and copied directlty from your buffer to disk controller. For get advantages of O_DIRECT, need to comply some conditions - keep aligned by memory page buffer address and buffer size aligned by I/O block.

    Anyway, if you use mmap, you do not use read/write. Moreover, after mmap, you can close file descriptor, and mapping will still works. So, O_DIRECT useless with mmap option.

    What can I recommend for increase performance:

    1. If your subsystem has many request for search missing key, you can create Bloom filter in the memory. Thereafter, you match your search key on Bloom filter http://en.wikipedia.org/wiki/Bloom_filter, and reject missing keys, without actual request to disk.

    2. For conserve memory, use 2-level scheme, when bucket heads you keep in the mmap-ed memory, but buckets itself you read from file by pread().

    Both options I implemented in the my autocomplete subsytem, you can see it online here: http://olegh.ftp.sh/autocomplete.html and estimate performance on the slow old computer - Celeron-300.

    这篇关于用于随机读取的mmap与O_DIRECT(涉及哪些缓冲区?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆