读取内存映射的bzip2压缩文件 [英] Reading memory mapped bzip2 compressed file

查看：138 发布时间：2020/5/9 23:53:40 python mmap bzip2

本文介绍了读取内存映射的bzip2压缩文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

所以我正在玩Wikipedia转储文件.这是一个已压缩的XML文件.我可以将所有文件写入目录，但是当我要进行分析时，必须重新读取磁盘上的所有文件.这使我可以随机访问，但是速度很慢.我有ram可以将整个压缩后的文件放到ram中.

So I'm playing with the Wikipedia dump file. It's an XML file that has been bzipped. I can write all the files to directories, but then when I want to do analysis, I have to reread all the files on the disk. This gives me random access, but it's slow. I have the ram to put the entire bzipped file into ram.

我可以很好地加载转储文件并读取所有行，但是由于它巨大，因此无法在其中查找.从表面上看，bz2库必须先读取并捕获偏移量，然后才能将我带到那里(并解压缩所有内容，因为偏移量位于解压缩的字节中).

I can load the dump file just fine and read all the lines, but I cannot seek in it as it's gigantic. From what it seems, the bz2 library has to read and capture the offset before it can bring me there (and decompress it all, as the offset is in decompressed bytes).

无论如何，我正在尝试映射转储文件(〜9.5演出)并将其加载到bzip中.我显然显然以前想在bzip文件上进行测试.

Anyway, I'm trying to mmap the dump file (~9.5 gigs) and load it into bzip. I obviously want to test this on a bzip file before.

我想将mmap文件映射到BZ2File，以便我可以查找它(以获取特定的未压缩字节偏移量)，但是从表面上看，不解压缩整个mmap文件是不可能的(这将是超过30 GB).

I want to map the mmap file to a BZ2File so I can seek through it (to get to a specific, uncompressed byte offset), but from what it seems, this is impossible without decompressing the entire mmap file (this would be well over 30 gigabytes).

我有什么选择吗?

这是我编写的要测试的一些代码.

Here's some code I wrote to test.

import bz2
import mmap

lines = '''This is my first line
This is the second
And the third
'''

with open("bz2TestFile", "wb") as f:
    f.write(bz2.compress(lines))

with open("bz2TestFile", "rb") as f:
    mapped = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)

    print "Part of MMAPPED"
    # This does not work until I hit a minimum length
    # due to (I believe) the checksums in the bz2 algorithm
    #
    for x in range(len(mapped)+2):
        line = mapped[0:x]
        try:
            print x
            print bz2.decompress(line)
        except:
            pass

# I can decompress the entire mmapped file
print ":entire mmap file:"
print bz2.decompress(mapped)

# I can create a bz2File object from the file path
# Is there a way to map the mmap object to this function?
print ":BZ2 File readline:"
bzF = bz2.BZ2File("bz2TestFile")

# Seek to specific offset
bzF.seek(22)
# Read the data
print bzF.readline()

这一切都让我感到奇怪，bz2文件对象有什么特别之处，它可以使bz2文件对象在查找后读取一行?是否必须先读取每一行，才能从算法中获取校验和才能正确计算?

This all makes me wonder though, what is special about the bz2 file object that allows it to read a line after seeking? Does it have to read every line before it to get the checksums from the algorithm to work out correctly?

读取内存映射的bzip2压缩文件 [英] Reading memory mapped bzip2 compressed file

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

读取内存映射的bzip2压缩文件 [英] Reading memory mapped bzip2 compressed file

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭