从很大的二进制文件中有效地读取几行 [英] Efficiently reading few lines from a very large binary file

查看:114
本文介绍了从很大的二进制文件中有效地读取几行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个简单的例子来说明我的问题: 我有一个很大的二进制文件,具有1000万个值.

我想从该文件中的某些点获取5K值.

我有一个索引列表,为我提供了在其中具有我的价值的文件中的确切位置.

为解决这个问题,我尝试了两种方法:

  1. 遍历所有值,并简单地使用seek()(从文件开头)获取每个值,如下所示:

    binaryFile_new = open(binary_folder_path, "r+b")
    for index in index_list:
        binaryFile_new.seek (size * (index), 0)
        wanted_line = binaryFile_new.read (size)
        wanted_line_list.append(wanted_line)
    binaryFile_new.close()
    

    但是据我了解,该解决方案从头开始读取每个索引,因此,就文件大小而言,复杂度为O(N ** 2).

  2. 对索引进行排序,这样我就可以像这样通过"once"文件查找当前位置:

    binaryFile_new = open(binary_folder_path, "r+b")
    sorted_index_list = sorted(index_list)
    for i, index in enumerate(sorted_index_list):
            if i == 0:
                    binaryFile_new.seek (size * (v), 0)
                else:
                    binaryFile_new.seek ((index - sorted_index_list[i-1]) * size - size, 1)
        binaryFile_new.seek (size * (index), 0)
        wanted_line = binaryFile_new.read (size)
        wanted_line_list.append(wanted_line)
    binaryFile_new.close()
    

    我希望第二个解决方案会更快,因为从理论上讲,一旦O(N),它将遍历整个文件.

    但是由于某些原因,这两个解决方案运行相同.

由于对多个文件并行运行此操作,因此我对内存使用也有严格的限制,所以我无法将文件读入内存.

也许mmap软件包会有所帮助?不过,我认为mmap还会扫描整个文件,直到到达索引为止,因此它不是真正的"随机访问.

解决方案

我会选择#1:

for index in index_list:
    binary_file.seek(size * index)
    # ...

(我对您的代码进行了一些清理,以符合Python命名约定,并避免使用魔法0常量,因为SEEK_SET仍然是默认值.)

据我了解,此解决方案从头开始读取每个索引,因此,就文件大小而言,复杂度为O(N ** 2).

不,seek()不会从头开始通读",这会破坏搜索点.寻求文件开头和文件结尾的费用大致相同.

对索引进行排序,以便在从当前位置查找时我可以遍历文件一次"

我无法为此快速找到参考,但我相信,使用SEEK_CUR而不是SEEK_SET来计算相对偏移量绝对没有意义.

从寻找到所需的顺序而不是随机的位置可能会有小的改进,因为在需要读取的许多点碰巧是随机的情况下,从缓存中提供随机读取的机会会增加彼此靠近(因此您的读取模式会触发文件系统中的预读).

也许mmap软件包会有所帮助?不过,我认为mmap还会扫描整个文件,直到到达索引为止,这样就不是真正的"随机访问了.

mmap不扫描文件.它将在程序的虚拟内存中设置一个与文件相对应的区域,这样,第一次访问该区域中的任何页面都会导致页面错误,在此期间,操作系统会从文件中读取该页面(数KB)(假设它是不在页面缓存中),然后继续执行程序.

互联网上充斥着关于readmmap相对优点的讨论,但我建议您不要尝试使用mmap 进行优化,并利用这段时间来了解虚拟内存

But as I understand this solution reads through from the beginning for each index, therefore the complexity is O(N**2) in terms of file size.

  • Sorting the indexes so I could go through the file "once" while seeking from the current position with something like that:

    binaryFile_new = open(binary_folder_path, "r+b")
    sorted_index_list = sorted(index_list)
    for i, index in enumerate(sorted_index_list):
            if i == 0:
                    binaryFile_new.seek (size * (v), 0)
                else:
                    binaryFile_new.seek ((index - sorted_index_list[i-1]) * size - size, 1)
        binaryFile_new.seek (size * (index), 0)
        wanted_line = binaryFile_new.read (size)
        wanted_line_list.append(wanted_line)
    binaryFile_new.close()
    

    I expected the second solution to be much faster because in theory it would go through the whole file once O(N).

    But for some reason both solutions run the same.

  • I also have a hard constraint on memory usage, as I run this operation in parallel and on many files, so I can't read files into memory.

    Maybe the mmap package will help? Though, I think mmap also scans the entire file until it gets to the index so it's not "true" random access.

    解决方案

    I'd go with #1:

    for index in index_list:
        binary_file.seek(size * index)
        # ...
    

    (I cleaned up your code a bit to comply with Python naming conventions and to avoid using a magic 0 constant, as SEEK_SET is default anyway.)

    as I understand this solution reads through from the beginning for each index, therefore the complexity is O(N**2) in terms of file size.

    No, a seek() does not "read through from the beginning", that would defeat the point of seeking. Seeking to the beginning of file and to the end of file have roughly the same cost.

    Sorting the indexes so I could go through the file "once" while seeking from the current position

    I can't quickly find a reference for this, but I believe there's absolutely no point in calculating the relative offset in order to use SEEK_CUR instead of SEEK_SET.

    There might be a small improvement just from seeking to the positions you need in order instead of randomly, as there's an increased chance your random reads will be serviced from cache, in case many of the points you need to read happen to be close to each other (and so your read patterns trigger read-ahead in the file system).

    Maybe the mmap package will help? Though, I think mmap also scans the entire file until it gets to the index so it's not "true" random access.

    mmap doesn't scan the file. It sets up a region in your program's virtual memory to correspond to the file, so that accessing any page from this region the first time leads to a page fault, during which the OS reads that page (several KB) from the file (assuming it's not in the page cache) before letting your program proceed.

    The internet is full of discussions of relative merits of read vs mmap, but I recommend you don't bother with trying to optimize by using mmap and use this time to learn about the virtual memory and the page cache.

    [edit] reading in chunks larger than the size of your values might save you a bit of CPU time in case many of the values you need to read are in the same chunk (which is not a given) - but unless your program is CPU bound in production, I wouldn't bother with that either.

    这篇关于从很大的二进制文件中有效地读取几行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆