压缩格式对档案中的随机访问有良好的支持? [英] Compression formats with good support for random access within archives?

查看:137
本文介绍了压缩格式对档案中的随机访问有良好的支持?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这与上一个问题类似,但是其中的答案不能满足我的需求,我的问题略有不同:

This is similar to a previous question, but the answers there don't satisfy my needs and my question is slightly different:

我目前对包含排序数据的一些非常大的文件使用gzip压缩。当文件未被压缩时,二进制搜索是一种方便而有效的方式来支持在排序数据中寻找位置。

I currently use gzip compression for some very large files which contain sorted data. When the files are not compressed, binary search is a handy and efficient way to support seeking to a location in the sorted data.

但是当文件被压缩时,棘手。我最近发现了 zlib Z_FULL_FLUSH 选项,可以在压缩期间在压缩输出中插入同步点( inflateSync(),然后可以从文件中的各个点开始读取)。这是好的,虽然我已经有文件必须重新压缩添加这个功能(奇怪的是 gzip 没有这个选项,但我愿意如果我必须写自己的压缩程序)。

But when the files are compressed, things get tricky. I recently found out about zlib's Z_FULL_FLUSH option, which can be used during compression to insert "sync points" in the compressed output (inflateSync() can then begin reading from various points in the file). This is OK, though files I already have would have to be recompressed to add this feature (and strangely gzip doesn't have an option for this, but I'm willing to write my own compression program if I must).

看起来从一个来源甚至 Z_FULL_FLUSH 不是一个完美的解决方案...不仅是它不是所有gzip档案都支持,但是在档案中检测同步点的想法可能会产生误报(通过与同步点的魔法数字重合,或者由于 Z_SYNC_FLUSH 也产生同步点,但它们不能用于随机访问)。

It seems from one source that even Z_FULL_FLUSH is not a perfect solution...not only is it not supported by all gzip archives, but the very idea of detecting sync points in archives may produce false positives (either by coincidence with the magic number for sync points, or due to the fact that Z_SYNC_FLUSH also produces sync points but they are not usable for random access).

有更好的解决方案吗?如果可能的话,我希望避免使用辅助文件进行索引,而对于准随机访问的显式的默认支持将是有帮助的(即使它是大粒度的 - 例如能够以每10 MB间隔开始读取)。是否有另一种压缩格式比gzip更好地支持随机读取?

Is there a better solution? I'd like to avoid having auxiliary files for indexing if possible, and explicit, default support for quasi-random access would be helpful (even if it's large-grained--like being able to start reading at each 10 MB interval). Is there another compression format with better support for random reads than gzip?

编辑:如我所说,我想做二叉搜索压缩数据。我不需要寻找到一个特定的(未压缩)位置 - 只是在压缩文件中寻找一些粗糙的粒度。我只需要支持像解压缩大约50%(25%,12.5%等)进入这个压缩文件的方式。

Edit: As I mentioned, I wish to do binary search in the compressed data. I don't need to seek to a specific (uncompressed) position--only to seek with some coarse granularity within the compressed file. I just want support for something like "Decompress the data starting roughly 50% (25%, 12.5%, etc.) of the way into this compressed file."

推荐答案

我不知道任何压缩文件格式,它将支持随机访问未压缩数据中的特定位置(除了多媒体格式),但你可以自己酿造。

I don't know of any compressed file format which would support random access to a specific location in the uncompressed data (well, except for multimedia formats), but you can brew your own.

例如,bzip2压缩文件由大小小于1MB的独立压缩块组成,它们由魔法字节序列分隔,因此您可以解析bzip2文件,获得块边界,然后只是解压缩右块。这需要一些索引来记住块的开始位置。

For example, bzip2 compressed files are composed of independent compressed blocks of size <1MB uncompressed, which are delimited by sequences of magic bytes, so you could parse the bzip2 file, get the block boundaries and then just uncompress the right block. This would need some indexing to remember where do the blocks start.

仍然,我认为最好的解决方案是将文件拆分为你选择的块,然后压缩它与一些archiver,如zip或rar,支持随机访问存档中的个别文件。

Still, I think the best solution would be to split your file into chunks of your choice, and then compressing it with some archiver, like zip or rar, which support random access to individual files in the archive.

这篇关于压缩格式对档案中的随机访问有良好的支持?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆