如何获得对gzip压缩文件的随机访问 [英] How to obtain random access of a gzip compressed file

查看:209
本文介绍了如何获得对gzip压缩文件的随机访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据 zlib.net上的常见问题解答,可以:

以压缩流随机访问数据

access data randomly in a compressed stream

我了解 Biopyton 1.60 ,其中:

支持读取和写入BGZF文件(Blocked GNU Zip格式),这是具有有效随机访问权限的GZIP的一种变体,最常用作BAM文件格式的一部分和在Tabix中使用.它在内部使用Python的zlib库,并提供了一个简单的界面,例如Python的gzip库.

supports reading and writing BGZF files (Blocked GNU Zip Format), a variant of GZIP with efficient random access, most commonly used as part of the BAM file format and in tabix. This uses Python’s zlib library internally, and provides a simple interface like Python’s gzip library.

但是对于我的用例,我不想使用该格式.基本上我想要一些东西,它可以模仿下面的代码:

But for my use case I don't want to use that format. Basically I want something, which emulates the code below:

import gzip
large_integer_new_line_start = 10**9
with gzip.open('large_file.gz','rt') as f:
    f.seek(large_integer_new_line_start)

,但具有本机zlib.net提供的效率,以提供对压缩流的随机访问.如何利用Python中的随机访问功能?

but with the efficiency offered by the native zlib.net to provide random access to the compressed stream. How do I leverage that random access capability in Python?

推荐答案

我放弃了使用Python对压缩文件进行随机访问的打算.取而代之的是,我将gzip压缩文件转换为块gzip压缩文件,并在该文件上使用块压缩/解压缩实用程序命令行:

I gave up on doing random access on a gzipped file using Python. Instead I converted my gzipped file to a block gzipped file with a block compression/decompression utility on the command line:

zcat large_file.gz | bgzip > large_file.bgz

然后,我使用了 BioPython ,并告诉我要获取bgzipped文件第100万行的virtual_offset.然后我便可以迅速找到virtual_offset了:

Then I used BioPython and tell to get the virtual_offset of line number 1 million of the bgzipped file. And then I was able to rapidly seek the virtual_offset afterwards:

from Bio import bgzf

file='large_file.bgz'

handle = bgzf.BgzfReader(file)
for i in range(10**6):
    handle.readline()
virtual_offset = handle.tell()
line1 = handle.readline()
handle.close()

handle = bgzf.BgzfReader(file)
handle.seek(virtual_offset)
line2 = handle.readline()
handle.close()

assert line1==line2

我还要在 查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆