Linux:大型int数组:mmap vs查找文件? [英] Linux: Large int array: mmap vs seek file?

查看:151
本文介绍了Linux:大型int数组:mmap vs查找文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个数据集,它是一个存储在4TB HDD ext4文件系统文件中的1e12 32位整数(4TB)的数组。

考虑数据很可能是随机的(或者至少似乎是随机的)。

$ p $伪$ $ $ b $ long long i = 0; i <(1LL << 40); i ++)
SetFileIntAt(i)= GetRandInt();另外,考虑到我希望以不可预知的顺序读取单个的int元素,并且算法运行得非常完美(正在进行中)。

$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ (GetFileInt(GetRand(1 <<;< 40)));

我们在Linux x86_64上,gcc。您可以假设系统具有4GB的内存(即比数据集小1000倍)

以下是架构访问的两种方式:
$ b $ (A)将该文件映射到一个4TB的内存块,并将其作为一个int数组访问它

(b)打开(2)该文件并使用寻求(2)和阅读(2)阅读整数。



出了A和B哪个会有更好的表现呢,为什么呢? b
$ b

是否还有另外一种设计可以提供比A或B更好的性能?

如果访问是真正随机的,我会说性能应该是相似的。操作系统将使用类似的缓存策略,无论数据页面是从文件映射的,还是文件数据缓存都没有与RAM的关联。



假设缓存无效:


  • 您可以使用 fadvise 预先声明您的访问模式并禁用readahead。
  • 由于地址空间布局随机化,虚拟地址空间中可能没有连续4 TB的块。

  • 如果您的数据集地址空间的问题可能会变得更加紧迫。


    所以我会用明确的读取。


    Suppose I have a dataset that is an array of 1e12 32-bit ints (4 TB) stored in a file on a 4TB HDD ext4 filesystem..

    Consider that the data is most likely random (or at least seems random).

    // pseudo-code
    for (long long i = 0; i < (1LL << 40); i++)
       SetFileIntAt(i) = GetRandInt();
    

    Further, consider that I wish to read individual int elements in an unpredictable order and that the algorithm runs indefinately (it is on-going).

    // pseudo-code
    while (true)
       UseInt(GetFileInt(GetRand(1<<40)));
    

    We are on Linux x86_64, gcc. You can assume system has 4GB of RAM (ie 1000x less than dataset)

    The following are two ways to architect access:

    (A) mmap the file to a 4TB block of memory, and access it as an int array

    (B) open(2) the file and use seek(2) and read(2) to read the ints.

    Out of A and B which will have the better performance?, and why?

    Is there another design that will give better performance than either A or B?

    解决方案

    I'd say performance should be similar if access is truly random. The OS will use a similar caching strategy whether the data page is mapped from a file or the file data is simply cached without an association to RAM.

    Assuming cache is ineffective:

    • You can use fadvise to declare your access pattern in advance and disable readahead.
    • Due to address space layout randomization, there might not be a contiguous block of 4 TB in your virtual address space.
    • If your data set ever expands, the address space issue might become more pressing.

    So I'd go with explicit reads.

    这篇关于Linux:大型int数组:mmap vs查找文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆