文本压缩期间存储概率表 [英] Storing Probability table during text compression

查看:76
本文介绍了文本压缩期间存储概率表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一个项目,比较静态和自适应形式的不同类型的文本压缩方法,例如Huffman和Arithmetic.我使用文本中每个字母的出现次数为两者创建一个概率表.现在,对于自适应形式,接收器不需要概率表,但是对于静态形式,我们也需要将该概率表也发送给接收器以对消息进行解码.现在,此表的存储将需要一些额外的位,在比较时应将其考虑在内.

I am doing a project where I compare different types of Text compression methods such as Huffman and Arithmetic for both static and adaptive form. I make a probability table for both using the number of occurrence of each letter in the text. Now, for adaptive form, the receiver does not need the Probability table but for the static form, we need to transmit this probability table as well to the receiver for decoding the message. Now this storing of the table will need some extra bits, which should be taken into account while comparing.

所以我的问题是:

  1. 将概率表(存储在文件中)的最佳解决方案是什么.
  2. 执行此操作所需的最小位数是多少? (我知道这取决于文本,但是有某种方法可以找到存储表所需的最小位).

非常感谢您.

推荐答案

从概率中,将代码长度分配给符号.要创建代码,接收方需要一个元组列表:(位计数,符号计数),然后是要分配给该代码的符号.现在,您可以尝试如何编码它们了.

From the probabilities, you assign code lengths to symbols. To create the code the receiver needs a list of tuples: (bit count, symbol count), followed by the symbols in the order to be allocated to the code. Now you can play around with how you encode those.

对符号列表进行编码可以使用以下事实:对于每个传输的符号,跟随符号所需的位数都会减少.尽早指定使用(例如)8位符号的某些子集的选项可能会有所帮助.随着代码字变得越来越长,对一系列符号进行编码可能会比较方便,而不是传输每个符号-也许可以用一种方式来表达较少的符号,其中空洞"可以表示为一些位数取决于游程的长度-或起始符号,长度和位向量(请注意,表示长度的位数取决于起始符号和剩余符号的数量,无需发送范围内的第一个和最后一个!)

Encoding the list of symbols can use the fact that for every symbol transmitted, the number of bits you need for following symbols goes down. An option to specify early on that some subset of (say) 8-bit symbols is used can help here. As the code words get longer, it may be handy to have an encoding for a run of symbols, rather than transmitting each one -- perhaps with a way to express a run less a few symbols, where the "holes" can be expressed in some number of bits which depends on the length of the run -- or a start symbol, length and bit-vector (noting that the number of bits to express the length depends on the start symbol and the number of symbols left, and there is no need to send a bit for the first and last in the range !)

霍夫曼代码表的编码本身就是一个完整的游戏.然后,对于短消息来说,该表可能会带来严重的开销……在这种情况下,(少量)常用表可能会带来更好的压缩效果.

The encoding of the Huffman code table is a whole game in itself. Then for short messages, the table can be a serious overhead... in which case, a (small) number of commonly useful tables may give a better compression.

对于每个符号的代码长度,您还可以使用霍夫曼编码,然后按符号顺序发送.重复计数机制及其霍夫曼(Huffman)可以为您提供帮助,以及一种跳过未使用符号(即代码长度为零的符号)的方式.当然,您可以添加一个第一级表来为此指定编码!

You can also mess about with a Huffman encoding for the code length of each symbol, and send those in symbol order. A repeat count mechanism, with its Huffman can help here, and a way of skipping runs of unused symbols (ie symbols with zero code lengths). You can, of course, add a first level table to specify the encoding for this !

另一种方法是多个位向量,每个码字长度一个向量.从具有最多符号的代码字长度开始,发出长度和位向量,然后从下一个人口最多的代码长度中以较小的位向量开始……依此类推.同样,一种编码游程和范围的方法可以减少所需的位数,并且在继续操作时,这些位数也将减少.

Another approach is a number of bit vectors, one vector for each code word length. Starting with the code word length with the most symbols, emit the length and a bit vector, then the next most populous code length with a smaller bit vector... and so on. Again, a way to encode runs and ranges can cut down the number of bits required, and again, as you proceed, the bits required for those goes down.

问题是,与代码表大小的比较有多敏感?显然,如果它非常敏感,那么研究应用狡猾可以做什么是很重要的.但是任何给定方案的有效性都将取决于它是否适合被压缩的典型"数据.

The question is, how sensitive is the comparison to the size of the code table ? Clearly, if it is very sensitive, then investigating what can be done by the application of cunning is important. But the effectiveness of any given scheme is going to depend on how well it fits "typical" data being compressed.

这篇关于文本压缩期间存储概率表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆