如何计算良好的预设字典放气压缩 [英] How to compute good preset dictionary for deflate compression

查看:167
本文介绍了如何计算良好的预设字典放气压缩的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个机会预先设定字典进行放气压缩。在我的情况下,这是有道理的,因为要压缩的数据相对较小1kb-3kb,我有一个大样本的代表性的例子。要压缩的数据包括任意字节序列,因此令牌化等不是一个好办法。此外,数据显示了很多重复(数据示例之间),所以良好的字典可能会提供非常好的结果。
问题是如何计算好的字典?有没有计算最佳字典(给定的样本数据)的算法?



我开始查看前缀树,但是不清楚如何在这个上下文中使用它们。 / p>

最好的问候,
Jarek

解决方案

意识到生成最优或甚至好的字典的算法。这通常是手工完成的。我认为一个后缀树将是一个很好的方法来找到字典的常见字符串,但我从来没有尝试过。



第一件尝试是简单连接32K的你的1-3K的例子,看看提供超过没有字典多少增益。然后你从那里弄乱它,改变例子的顺序或者将例子中重复的部分拉出到字典的结尾。



请注意,最常见的字符串因为较短的距离占用较少的位。


I have an opportunity to preset dictionary for deflate compression. It makes sense in my case, because data to be compressed is relatively small 1kb-3kb and I have a large sample of representative examples. Data to be compressed consists of arbitrary sequence of bytes, so tokenization etc. is not a good way to go. Also, data shows a lot of repetition (between data examples), so good dictionary could potentially give very good results. The question is how calculate good dictionary? Is there an algorithm which calculates optimal dictionary (given sample data)?

I started looking at prefix trees, but it is not clear how to use them in this context.

Best regards, Jarek

解决方案

I am not aware of an algorithm to generate an optimal or even a good dictionary. This is generally done by hand. I think that a suffix tree would be a good approach to finding common strings for a dictionary, but I have never tried it.

The first thing to try is to simply concatenate 32K worth of your 1-3K examples and see how much gain that provides over no dictionary. Then you mess with it from there, changing the ordering of examples or pulling out repeated pieces in the examples to the end of the dictionary.

Note that the most common strings should be put at the end, since shorter distances take fewer bits.

这篇关于如何计算良好的预设字典放气压缩的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆