重用压缩字典 [英] Reusing compression dictionary

查看:105
本文介绍了重用压缩字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有压缩工具可让您将其字典(或类似文件)与压缩输出分开输出,以便在以后的压缩中可以重新使用该字典?想法是一次转移字典,或在远程站点使用参考字典,并使压缩文件更小以进行转移。

Is there a compression tool that will let you output its dictionary (or similar) separate from the compressed output such that the dictionary can be re-used on a subsequent compression? The idea would be to transfer the dictionary one time, or use a reference dictionary at a remote site, and make a compressed file even smaller to transfer.

我看过在文档中找到了一堆常见的压缩工具,但我真的找不到支持这种压缩工具的工具。但是大多数常见的压缩工具不是直接的字典压缩。

I've looked at the docs for a bunch of common compression tools, and I can't really find one that supports this. But most common compression tools aren't straight dictionary compression.

我想象的用法是:

compress_tool --dictionary compressed.dict -o compressed.data uncompressed
decompress_tool --dictionary compressed.dict -o uncompressed compressed.data

为了扩展用例,我有一个500MB的二进制文件,希望通过慢速网络进行复制。仅压缩文件即可产生200MB的大小,仍然比我想要的大。但是,我的源文件和目标文件都有一个文件F',该文件与F非常相似,但是文件差异太大,二进制diff工具无法正常工作。我当时想,如果我在两个站点上都压缩F’,然后重新使用有关该压缩的信息来压缩源上的F,则可能会从传输中消除一些信息,而这些信息可以使用F’在目标上重建。

To expand on my use case, I have a binary 500MB file F I want to copy over a slow network. Compressing the file alone yields a size of 200MB, which is still larger than I'd like. However, both my source and destination have a file F' which is very similar to F, but sufficiently different that binary diff tools don't work well. I was thinking that if I compress F' on both sites and then re-use information about that compression to compress F on the source, I could possibly eliminate some information from the transfer that could be rebuilt on the destination using F'.

推荐答案

预设字典对于这么大的文件并不是很有用。它们非常适合小数据(例如压缩数据库中的字段,RPC查询/响应,XML或JSON片段等),但是对于像您这样的大文件,该算法可以快速建立自己的字典。

Preset dictionaries aren't really useful for files that size. They're great for small data (think compressing fields in a database, RPC queries/responses, snippets of XML or JSON, etc.), but for larger files like you have the algorithm builds up its own dictionary very quickly.

这就是说,正巧我在 Squash 是最近才发布的,我确实有一些代码可以执行您在zlib插件中所讨论的内容。我不会将其推送到主节点(如果我决定支持预设字典,我会想到一个不同的API),但是如果您想使用它,我只是将其推送到 deflate-dictionary-file分支看。要进行压缩,请执行以下操作

That said, it just so happens that I was playing with preset dictionaries in Squash fairly recently, and I do have some code which does pretty much what you're talking about for the zlib plugin. I'm not going to push it to master (I have a different API in mind if I decide to support preset dictionaries), but I've just pushed it to the 'deflate-dictionary-file' branch if you want to take a look. To compress, do something like

squash -ko dictionary-file=foo.dict -c zlib:deflate uncompressed compressed.deflate

要解压缩,

squash -dko dictionary-file=foo.dict -c zlib:deflate compressed.deflate decompressed

AFAIK zlib中没有支持构建字典的内容-您必须自己做。 zlib文档描述了格式:

AFAIK there is nothing in zlib which supports building a dictionary--you have to do that yourself. The zlib documentation describes the "format":


字典应由字符串(字节序列)组成,这些字符串很可能会在后面的版本中遇到。要压缩的数据,最常用的字符串最好放在字典的末尾。当要压缩的数据较短且可以准确预测时,使用字典最有用。与默认的空字典相比,数据可以得到更好的压缩。

The dictionary should consist of strings (byte sequences) that are likely to be encountered later in the data to be compressed, with the most commonly used strings preferably put towards the end of the dictionary. Using a dictionary is most useful when the data to be compressed is short and can be predicted with good accuracy; the data can then be compressed better than with the default empty dictionary.

为了进行测试,我使用了类似以下内容(YMMV):

For testing I was using something like this (YMMV):

cat input | tr ' ' '\n' | sort | uniq -c | awk '{printf "%06d %s\n",$1,$2}' | sort | cut -b8- | tail -c32768

这篇关于重用压缩字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆