多部分gzip文件随机访问(在Java中) [英] Multi-part gzip file random access (in Java)

查看:199
本文介绍了多部分gzip文件随机访问(在Java中)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这可能属于不真正可行或不真正值得努力的范围,但这里可以。

This may fall in the realm of "not really feasible" or "not really worth the effort" but here goes.

我试图随机访问记录存储在多部分gzip文件中。具体来说,我感兴趣的文件是压缩的 Heretrix Arc文件。 (如果您不熟悉多部分gzip文件,gzip规范允许多个gzip流连接在一个gzip文件中,它们不共享任何字典信息,这是简单的二进制附加。)

I'm trying to randomly access records stored inside a multi-part gzip file. Specifically, the files I'm interested in are compressed Heretrix Arc files. (In case you aren't familiar with multi-part gzip files, the gzip spec allows multiple gzip streams to be concatenated in a single gzip file. They do not share any dictionary information, it is simple binary appending.)

我认为应该可以通过查找文件中的某个偏移量,然后扫描gzip魔术头字节(即0x1f8b,根据 RFC ),并尝试从以下字节读取gzip流。这种方法的问题是,这些相同的字节也可能出现在实际数据内部,因此寻找这些字节可能导致无效的位置开始读取gzip流。是否有更好的方法来处理随机访问,因为记录偏移不是先验已知的?

I'm thinking it should be possible to do this by seeking to a certain offset within the file, then scan for the gzip magic header bytes (i.e. 0x1f8b, as per the RFC), and attempt to read the gzip stream from the following bytes. The problem with this approach is that those same bytes can appear inside the actual data as well, so seeking for those bytes can lead to an invalid position to start reading a gzip stream from. Is there a better way to handle random access, given that the record offsets aren't known a priori?

推荐答案

GZIP,正如你已经意识到的,是不友好的随机访问。

The design of GZIP, as you have realized, is not friendly to random access.

您可以按照说明进行操作,然后如果遇到解压缩程序中的错误,则可以断定您发现的签名实际上是压缩数据。

如果你完成解压缩,那么很容易通过CRC32验证刚解压缩的流的有效性。

You can do as you describe, and then if you run into an error in the decompressor, conclude that the signature you found was actually compressed data.
If you finish decompressing, then it's easy to verify the validity of the stream just decompressed, via the CRC32.

如果文件不是这么大,你可以考虑只是串行地解压缩所有条目,以及保留签名的偏移以便构建目录。解压缩时,将字节转储到位桶。此时,您将生成一个目录,然后您可以根据文件名,日期或其他元数据支持随机访问。

If the files are not so big, you might consider just de-compressing all of the entries in series, and retaining the offsets of the signatures so as to build a directory. As you decompress, dump the bytes to a bit bucket. At that point you will have generated a directory, and you can then support random access based on filename, date, or other metadata.

这对100k以下的文件来说会相当快。正如一个猜测,如果你有10个文件,每个大约100k,它可能会在2s在一个现代的CPU。这是我的意思是相当快。但只有你知道你的应用程序的perf要求。

This will be reasonably fast for files below 100k. Just as a guess, if you had 10 files of around 100k each, it would probably be done in 2s on a modern CPU. This is what I mean by "pretty fast". But only you know the perf requirements of your application .

你有一个GZipInputStream类吗?如果是,您就在那里。

Do you have a GZipInputStream class? If so you are half-way there.

这篇关于多部分gzip文件随机访问(在Java中)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆