Linux:大型int数组:mmap vs查找文件? [英] Linux: Large int array: mmap vs seek file?
问题描述
考虑数据很可能是随机的(或者至少似乎是随机的)。
$ p $伪$ $ $ b $ long long i = 0; i <(1LL << 40); i ++)
SetFileIntAt(i)= GetRandInt();另外,考虑到我希望以不可预知的顺序读取单个的int元素,并且算法运行得非常完美(正在进行中)。
$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ (GetFileInt(GetRand(1 <<;< 40)));
我们在Linux x86_64上,gcc。您可以假设系统具有4GB的内存(即比数据集小1000倍)
以下是架构访问的两种方式:
$ b $ (A)将该文件映射到一个4TB的内存块,并将其作为一个int数组访问它
(b)打开(2)该文件并使用寻求(2)和阅读(2)阅读整数。
出了A和B哪个会有更好的表现呢,为什么呢? b
$ b
是否还有另外一种设计可以提供比A或B更好的性能?
如果访问是真正随机的,我会说性能应该是相似的。操作系统将使用类似的缓存策略,无论数据页面是从文件映射的,还是文件数据缓存都没有与RAM的关联。假设缓存无效:
- 您可以使用
fadvise
预先声明您的访问模式并禁用readahead。
- 由于地址空间布局随机化,虚拟地址空间中可能没有连续4 TB的块。
- 如果您的数据集地址空间的问题可能会变得更加紧迫。
所以我会用明确的读取。
Suppose I have a dataset that is an array of 1e12 32-bit ints (4 TB) stored in a file on a 4TB HDD ext4 filesystem..
Consider that the data is most likely random (or at least seems random).
// pseudo-code for (long long i = 0; i < (1LL << 40); i++) SetFileIntAt(i) = GetRandInt();
Further, consider that I wish to read individual int elements in an unpredictable order and that the algorithm runs indefinately (it is on-going).
// pseudo-code while (true) UseInt(GetFileInt(GetRand(1<<40)));
We are on Linux x86_64, gcc. You can assume system has 4GB of RAM (ie 1000x less than dataset)
The following are two ways to architect access:
(A) mmap the file to a 4TB block of memory, and access it as an int array
(B) open(2) the file and use seek(2) and read(2) to read the ints.
Out of A and B which will have the better performance?, and why?
Is there another design that will give better performance than either A or B?
解决方案I'd say performance should be similar if access is truly random. The OS will use a similar caching strategy whether the data page is mapped from a file or the file data is simply cached without an association to RAM.
Assuming cache is ineffective:
- You can use
fadvise
to declare your access pattern in advance and disable readahead. - Due to address space layout randomization, there might not be a contiguous block of 4 TB in your virtual address space.
- If your data set ever expands, the address space issue might become more pressing.
So I'd go with explicit reads.
这篇关于Linux:大型int数组:mmap vs查找文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- You can use