mmap()与阅读块 [英] mmap() vs. reading blocks

查看:188
本文介绍了mmap()与阅读块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个程序,将处理可能是100GB或更大的文件。这些文件包含可变长度记录集。我已经有了第一个实现并运行,现在正在寻求提高性能,特别是在输入文件被扫描多次后更有效地执行I / O。



使用 mmap()相对于通过C ++的 fstream 库读取的块是否有经验法则?我想做的是将大块从磁盘读入缓冲区,从缓冲区处理完整的记录,然后阅读更多。



mmap 'd块需要位于页面大小的边界(我的理解),记录可能潜在地变得非常混乱像跨页边界。使用 fstream ,我可以寻找一个记录的开始,并再次开始阅读,因为我们不限于阅读位于页面大小边界的块。 p>

如何在这两个选项之间做出决定,而不首先完成一个完整的实现?任何经验法则(例如 mmap()是2倍快)或简单测试?

解决方案

我试图在Linux上找到关于mmap / read性能的最后一个单词,我遇到了一个不错的帖子(链接)。它是从2000年开始,所以从那时起,内核中的IO和虚拟内存有了很多改进,但它很好地解释了为什么 mmap read 可能更快或更慢。




  • 调用 mmap 具有比读取更多的开销(就像 epoll code>,这比 read 有更多的开销)。更改虚拟内存映射在某些处理器上是非常昂贵的操作,因为在不同进程之间切换的代价是昂贵的。

  • IO系统已经可以使用磁盘缓存,



但是,




  • 内存映射对随机访问通常更快,特别是如果您的访问模式稀疏和不可预测。

  • 内存映射允许您使用缓存中的页面保持,直到完成。这意味着,如果你长时间使用一个文件,然后关闭它并重新打开它,页面仍然会被缓存。使用,您的文件可能已经从以前的缓存刷新。如果您使用文件并立即将其舍弃,则不适用。 (如果你试图将 mlock 页面保存在缓存中,你试图超越磁盘缓存,这种愚弄很少有助于系统性能)。

  • 直接读取文件非常简单快速。



讨论mmap / read提醒我两个其他性能讨论:




  • 一些Java程序员惊讶地发现非阻塞I / O通常比阻塞I /


  • 其他一些网络程序员惊讶地发现, epoll 通常比 poll 更慢,如果你知道管理 epoll




结论:如果您随机访问数据,保存它很长时间,或者如果你知道你可以与其他进程共享它( MAP_SHARED 是不是很有趣,如果没有实际共享)。如果顺序访问数据或读取后丢弃,则正常读取文件。如果任何一种方法使您的程序不那么复杂,请 。对于许多真实世界的情况,没有确定的方式来显示一个更快,而不测试你的实际应用程序,而不是一个基准。



(对不起,这个问题,但我正在寻找答案,这个问题一直出现在Google搜索结果的顶部。)


I'm working on a program that will be processing files that could potentially be 100GB or more in size. The files contain sets of variable length records. I've got a first implementation up and running and am now looking towards improving performance, particularly at doing I/O more efficiently since the input file gets scanned many times.

Is there a rule of thumb for using mmap() versus reading in blocks via C++'s fstream library? What I'd like to do is read large blocks from disk into a buffer, process complete records from the buffer, and then read more.

The mmap() code could potentially get very messy since mmap'd blocks need to lie on page sized boundaries (my understanding) and records could potentially like across page boundaries. With fstreams, I can just seek to the start of a record and begin reading again, since we're not limited to reading blocks that lie on page sized boundaries.

How can I decide between these two options without actually writing up a complete implementation first? Any rules of thumb (e.g., mmap() is 2x faster) or simple tests?

解决方案

I was trying to find the final word on mmap / read performance on Linux and I came across a nice post (link) on the Linux kernel mailing list. It's from 2000, so there have been many improvements to IO and virtual memory in the kernel since then, but it nicely explains the reason why mmap or read might be faster or slower.

  • A call to mmap has more overhead than read (just like epoll has more overhead than poll, which has more overhead than read). Changing virtual memory mappings is a quite expensive operation on some processors for the same reasons that switching between different processes is expensive.
  • The IO system can already use the disk cache, so if you read a file, you'll hit the cache or miss it no matter what method you use.

However,

  • Memory maps are generally faster for random access, especially if your access patterns are sparse and unpredictable.
  • Memory maps allow you to keep using pages from the cache until you are done. This means that if you use a file heavily for a long period of time, then close it and reopen it, the pages will still be cached. With read, your file may have been flushed from the cache ages ago. This does not apply if you use a file and immediately discard it. (If you try to mlock pages just to keep them in cache, you are trying to outsmart the disk cache and this kind of foolery rarely helps system performance).
  • Reading a file directly is very simple and fast.

The discussion of mmap/read reminds me of two other performance discussions:

  • Some Java programmers were shocked to discover that nonblocking I/O is often slower than blocking I/O, which made perfect sense if you know that nonblocking I/O requires making more syscalls.

  • Some other network programmers were shocked to learn that epoll is often slower than poll, which makes perfect sense if you know that managing epoll requires making more syscalls.

Conclusion: Use memory maps if you access data randomly, keep it around for a long time, or if you know you can share it with other processes (MAP_SHARED isn't very interesting if there is no actual sharing). Read files normally if you access data sequentially or discard it after reading. And if either method makes your program less complex, do that. For many real world cases there's no sure way to show one is faster without testing your actual application and NOT a benchmark.

(Sorry for necro'ing this question, but I was looking for an answer and this question kept coming up at the top of Google results.)

这篇关于mmap()与阅读块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆