Linux：大型int数组：mmap vs查找文件？ [英] Linux: Large int array: mmap vs seek file?

查看：151 发布时间：2017/11/6 21:37:11 linux memory memory-management filesystems x86-64

本文介绍了Linux：大型int数组：mmap vs查找文件？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有一个数据集，它是一个存储在4TB HDD ext4文件系统文件中的1e12 32位整数（4TB）的数组。

考虑数据很可能是随机的（或者至少似乎是随机的）。

$ p $伪$ $ $ b $ long long i = 0; i <（1LL << 40）; i ++）
SetFileIntAt（i）= GetRandInt（）;另外，考虑到我希望以不可预知的顺序读取单个的int元素，并且算法运行得非常完美（正在进行中）。

$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ （GetFileInt（GetRand（1 <<;< 40）））;

我们在Linux x86_64上，gcc。您可以假设系统具有4GB的内存（即比数据集小1000倍）

以下是架构访问的两种方式：
$ b $ （A）将该文件映射到一个4TB的内存块，并将其作为一个int数组访问它

（b）打开（2）该文件并使用寻求（2）和阅读（2）阅读整数。

出了A和B哪个会有更好的表现呢，为什么呢？ b
$ b

是否还有另外一种设计可以提供比A或B更好的性能？

如果访问是真正随机的，我会说性能应该是相似的。操作系统将使用类似的缓存策略，无论数据页面是从文件映射的，还是文件数据缓存都没有与RAM的关联。

假设缓存无效：

您可以使用 fadvise 预先声明您的访问模式并禁用readahead。

由于地址空间布局随机化，虚拟地址空间中可能没有连续4 TB的块。

如果您的数据集地址空间的问题可能会变得更加紧迫。

所以我会用明确的读取。

Suppose I have a dataset that is an array of 1e12 32-bit ints (4 TB) stored in a file on a 4TB HDD ext4 filesystem..

Consider that the data is most likely random (or at least seems random).
// pseudo-code for (long long i = 0; i < (1LL << 40); i++) SetFileIntAt(i) = GetRandInt();
Further, consider that I wish to read individual int elements in an unpredictable order and that the algorithm runs indefinately (it is on-going).
// pseudo-code while (true) UseInt(GetFileInt(GetRand(1<<40)));
We are on Linux x86_64, gcc. You can assume system has 4GB of RAM (ie 1000x less than dataset)

The following are two ways to architect access:

(A) mmap the file to a 4TB block of memory, and access it as an int array

(B) open(2) the file and use seek(2) and read(2) to read the ints.

Out of A and B which will have the better performance?, and why?

Is there another design that will give better performance than either A or B?
解决方案
I'd say performance should be similar if access is truly random. The OS will use a similar caching strategy whether the data page is mapped from a file or the file data is simply cached without an association to RAM.

Assuming cache is ineffective:

You can use fadvise to declare your access pattern in advance and disable readahead.

Due to address space layout randomization, there might not be a contiguous block of 4 TB in your virtual address space.

If your data set ever expands, the address space issue might become more pressing.

So I'd go with explicit reads.

这篇关于Linux：大型int数组：mmap vs查找文件？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Linux：大型int数组：mmap vs查找文件？ [英] Linux: Large int array: mmap vs seek file?

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

Linux：大型int数组：mmap vs查找文件？ [英] Linux: Large int array: mmap vs seek file?

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭