对档案中的随机访问有良好支持的压缩格式? [英] Compression formats with good support for random access within archives?

查看:29
本文介绍了对档案中的随机访问有良好支持的压缩格式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这类似于上一个问题,但那里的答案不能满足我的需求,我的问题略有不同:

This is similar to a previous question, but the answers there don't satisfy my needs and my question is slightly different:

我目前对一些包含排序数据的非常大的文件使用 gzip 压缩.当文件未压缩时,二分搜索是一种方便且有效的方式来支持在已排序数据中查找位置.

I currently use gzip compression for some very large files which contain sorted data. When the files are not compressed, binary search is a handy and efficient way to support seeking to a location in the sorted data.

但是当文件被压缩时,事情变得棘手.我最近发现了 zlibZ_FULL_FLUSH 选项,它可以在压缩过程中使用在压缩输出中插入同步点"(inflateSync() 然后可以从文件中的各个点开始读取).这没关系,尽管我已经拥有的文件必须重新压缩才能添加此功能(奇怪的是 gzip 没有为此提供选项,但我愿意编写自己的压缩程序,如果我必须).

But when the files are compressed, things get tricky. I recently found out about zlib's Z_FULL_FLUSH option, which can be used during compression to insert "sync points" in the compressed output (inflateSync() can then begin reading from various points in the file). This is OK, though files I already have would have to be recompressed to add this feature (and strangely gzip doesn't have an option for this, but I'm willing to write my own compression program if I must).

似乎来自 一个来源 甚至 Z_FULL_FLUSH 都不是一个完美的解决方案......不仅不是所有 gzip 档案都支持它,而且检测同步点的想法在存档中可能会产生误报(要么与同步点的幻数重合,要么由于 Z_SYNC_FLUSH 也产生同步点,但它们不能用于随机访问).

It seems from one source that even Z_FULL_FLUSH is not a perfect solution...not only is it not supported by all gzip archives, but the very idea of detecting sync points in archives may produce false positives (either by coincidence with the magic number for sync points, or due to the fact that Z_SYNC_FLUSH also produces sync points but they are not usable for random access).

有更好的解决方案吗?如果可能,我想避免使用辅助文件进行索引,并且对准随机访问的显式默认支持会有所帮助(即使它是大粒度的——比如能够以每 10 MB 的间隔开始读取).有没有比 gzip 更好地支持随机读取的另一种压缩格式?

Is there a better solution? I'd like to avoid having auxiliary files for indexing if possible, and explicit, default support for quasi-random access would be helpful (even if it's large-grained--like being able to start reading at each 10 MB interval). Is there another compression format with better support for random reads than gzip?

编辑:正如我所提到的,我希望在压缩数据中进行二分查找.我不需要寻找特定的(未压缩的)位置——只需要在压缩文件中寻找一些粗粒度的位置.我只想支持诸如将大约 50%(25%、12.5% 等)的数据解压缩到这个压缩文件中."

Edit: As I mentioned, I wish to do binary search in the compressed data. I don't need to seek to a specific (uncompressed) position--only to seek with some coarse granularity within the compressed file. I just want support for something like "Decompress the data starting roughly 50% (25%, 12.5%, etc.) of the way into this compressed file."

推荐答案

我不知道有任何压缩文件格式支持随机访问未压缩数据中的特定位置(当然,多媒体格式除外),但是你可以自己酿造.

I don't know of any compressed file format which would support random access to a specific location in the uncompressed data (well, except for multimedia formats), but you can brew your own.

例如,bzip2 压缩文件由大小小于 1MB 的独立压缩块组成,未压缩的块由魔术字节序列分隔,因此您可以解析 bzip2 文件,获取块边界,然后解压正确的块.这需要一些索引来记住块从哪里开始.

For example, bzip2 compressed files are composed of independent compressed blocks of size <1MB uncompressed, which are delimited by sequences of magic bytes, so you could parse the bzip2 file, get the block boundaries and then just uncompress the right block. This would need some indexing to remember where do the blocks start.

不过,我认为最好的解决方案是将您的文件拆分成您选择的块,然后使用一些存档程序压缩它,例如 zip 或 rar,它们支持随机访问存档中的单个文件.

Still, I think the best solution would be to split your file into chunks of your choice, and then compressing it with some archiver, like zip or rar, which support random access to individual files in the archive.

这篇关于对档案中的随机访问有良好支持的压缩格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆