读取内存映射的bzip2压缩文件 [英] Reading memory mapped bzip2 compressed file

查看:138
本文介绍了读取内存映射的bzip2压缩文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我正在玩Wikipedia转储文件.这是一个已压缩的XML文件.我可以将所有文件写入目录,但是当我要进行分析时,必须重新读取磁盘上的所有文件.这使我可以随机访问,但是速度很慢.我有ram可以将整个压缩后的文件放到ram中.

So I'm playing with the Wikipedia dump file. It's an XML file that has been bzipped. I can write all the files to directories, but then when I want to do analysis, I have to reread all the files on the disk. This gives me random access, but it's slow. I have the ram to put the entire bzipped file into ram.

我可以很好地加载转储文件并读取所有行,但是由于它巨大,因此无法在其中查找.从表面上看,bz2库必须先读取并捕获偏移量,然后才能将我带到那里(并解压缩所有内容,因为偏移量位于解压缩的字节中).

I can load the dump file just fine and read all the lines, but I cannot seek in it as it's gigantic. From what it seems, the bz2 library has to read and capture the offset before it can bring me there (and decompress it all, as the offset is in decompressed bytes).

无论如何,我正在尝试映射转储文件(〜9.5演出)并将其加载到bzip中.我显然显然以前想在bzip文件上进行测试.

Anyway, I'm trying to mmap the dump file (~9.5 gigs) and load it into bzip. I obviously want to test this on a bzip file before.

我想将mmap文件映射到BZ2File,以便我可以查找它(以获取特定的未压缩字节偏移量),但是从表面上看,不解压缩整个mmap文件是不可能的(这将是超过30 GB).

I want to map the mmap file to a BZ2File so I can seek through it (to get to a specific, uncompressed byte offset), but from what it seems, this is impossible without decompressing the entire mmap file (this would be well over 30 gigabytes).

我有什么选择吗?

这是我编写的要测试的一些代码.

Here's some code I wrote to test.

import bz2
import mmap

lines = '''This is my first line
This is the second
And the third
'''

with open("bz2TestFile", "wb") as f:
    f.write(bz2.compress(lines))

with open("bz2TestFile", "rb") as f:
    mapped = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)

    print "Part of MMAPPED"
    # This does not work until I hit a minimum length
    # due to (I believe) the checksums in the bz2 algorithm
    #
    for x in range(len(mapped)+2):
        line = mapped[0:x]
        try:
            print x
            print bz2.decompress(line)
        except:
            pass

# I can decompress the entire mmapped file
print ":entire mmap file:"
print bz2.decompress(mapped)

# I can create a bz2File object from the file path
# Is there a way to map the mmap object to this function?
print ":BZ2 File readline:"
bzF = bz2.BZ2File("bz2TestFile")

# Seek to specific offset
bzF.seek(22)
# Read the data
print bzF.readline()

这一切都让我感到奇怪,bz2文件对象有什么特别之处,它可以使bz2文件对象在查找后读取一行?是否必须先读取每一行,才能从算法中获取校验和才能正确计算?

This all makes me wonder though, what is special about the bz2 file object that allows it to read a line after seeking? Does it have to read every line before it to get the checksums from the algorithm to work out correctly?

推荐答案

我找到了答案!詹姆斯·泰勒(James Taylor)编写了一些用于在BZ2文件中搜索的脚本,他的脚本在biopython模块中.

I found an answer! James Taylor wrote a couple scripts for seeking in BZ2 files, and his scripts are in the biopython module.

https://bitbucket.org/james_taylor/bx-python/overview

尽管它们不允许在BZ2文件中查找任意字节偏移,但这些脚本工作得很好,但他的脚本读取了BZ2数据块,并允许基于块进行查找.

These work pretty well, although they do not allow for seeking to arbitrary byte offsets in the BZ2 file, his scripts read out blocks of BZ2 data and allow seeking based on blocks.

尤其是,请参见 bx-python/wiki/IO/SeekingInBzip2Files

这篇关于读取内存映射的bzip2压缩文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆