LZ77压缩保留字节“ ,> [英] LZ77 compression reserved bytes "< , >"

查看:179
本文介绍了LZ77压缩保留字节“ ,>的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习LZ77压缩,我发现当我找到一个重复的字节串时,我可以使用< distance,length> ,并且保留<,,,>字节。所以...如何压缩有这些字节的文件,如果我不能压缩这些字节,但不能更改不同的字节(因为解码器将无法读取它)。有办法吗?或者解码器只解码是否有一个确切的< d,l> 字符串? (如果有的话,那么想象如果通过coencidence,我们在文件中找到这些字节)会发生什么?)



谢谢!


LZ77是通过解压缩缓冲区中的字符串的长度和距当前位置的距离来引用字符串。但是你是如何编码这些反向引用。许多LZ77的实现以不同的方式做。



但是你是对的,必须有一些方法来区分文字(未压缩的数据片段,从输入到输出)从反向引用(从已经解压缩的部分复制)。



一种方法是保留一些字符作为特殊(所谓的转义序列)。你可以这样做,也就是使用< 来标记反向引用的开始。但是,如果它是一个文本,那么你还需要一种输出< 的方法。例如,可以通过确定在< 之后有另一个< ,那么它意味着字面值,你只需输出一个< 。或者,你可以确定,如果< 之后立即有> 反向引用,所以你只是输出<



这也不是最有效的方法编码这些反向引用,因为它使用几个字节来编码反向引用,因此它将仅在引用长于这几个字节的字符串时才有效。对于较短的反向引用,它将膨胀数据而不是压缩它们,除非你确定匹配短于几个字节被留下,而不是生成反向引用。



如果只压缩纯旧的ASCII文本,则可以采用更好的编码方案,因为ASCII只使用8位中的7位在一个字节。因此,您可以使用最高位来发送反向参考信号,然后使用剩余的7位作为长度,并将下一个字节(或两个)用作反向参考的距离。这样,您可以通过检查其最高位来确定下一个字节是字面ASCII字符还是反向引用。如果它为0,只输出字符。如果它是1,使用以下7位作为长度,并读取接下来的2个字节使用它作为距离。这样,每个反向引用都需要3个字节,因此您可以有效地压缩重复序列长度超过3个字符的文本文件。



但是还有一个更好的方法这可以提供更多的压缩:你可以用可变长度的位代码替换你的字符,这样做的方式使得出现更多的字符将具有最短的代码,而那些罕见的代码将有更长的代码。为了实现这一点,这些代码必须是所谓的前缀代码,使得没有代码将是一些其他代码的前缀。当代码具有此属性时,您可以始终通过按顺序读取这些位来区分它们,直到对其中的一些进行解码。然后你可以确保你不会通过阅读更多的位获得任何其他有效的项目。下一位始终启动另一个新序列。要产生这样的代码,你需要使用哈夫曼树。然后,您可以将所有字节和不同长度的引用连接到一个这样的树中,并根据它们的频率为它们生成不同的位代码。当你试图解码它们,你只是读取位,直到你达到一些这些元素的代码,然后你知道确定它是一个字符的代码或代码的back-reference的长度。在第二种情况下,然后读取一些附加位用于反向引用的距离(也用前缀代码编码)。这是DEFLATE压缩方案。但这是另一个故事,你会在@MarkAdler提供的RFC中找到详细信息。


I'm learning about LZ77 compression, and I saw that when I find a repeated string of bytes, I can use a pointer of the form <distance, length>, and that the "<", ",", ">" bytes are reserved. So... How do I compress a file that has these bytes, if I cannot compress these byte,s but cannot change it by a different byte (because decoders wouldn't be able to read it). Is there a way? Or decoders only decode is there is a exact <d, l> string? (if there is, so imagine if by a coencidence, we find these bytes in a file. What would happen?)

Thanks!

解决方案

LZ77 is about referencing strings back in the decompressing buffer by their lengths and distances from the current position. But it is left to you how do you encode these back-references. Many implementations of LZ77 do it in different ways.

But you are right that there must be some way to distinguish "literals" (uncompressed pieces of data meant to be copied "as is" from the input to the output) from "back-references" (which are copied from already uncompressed portion).

One way to do it is reserving some characters as "special" (so called "escape sequences"). You can do it the way you did it, that is, by using < to mark the start of a back-reference. But then you also need a way to output < if it is a literal. You can do it, for example, by establishing that when after < there's another <, then it means a literal, and you just output one <. Or, you can establish that if after < there's immediately >, with nothing in between, then that's not a back-reference, so you just output <.

It also wouldn't be the most efficient way to encode those back-references, because it uses several bytes to encode a back-reference, so it will become efficient only for referencing strings longer than those several bytes. For shorter back-references it will inflate the data instead of compressing them, unless you establish that matches shorter than several bytes are being left as is, instead of generating back-references. But again, this means lower compression gains.

If you compress only plain old ASCII texts, you can employ a better encoding scheme, because ASCII uses just 7 out of 8 bits in a byte. So you can use the highest bit to signal a back-reference, and then use the remaining 7 bits as length, and the very next byte (or two) as back-reference's distance. This way you can always tell for sure whether the next byte is a literal ASCII character or a back-reference, by checking its highest bit. If it is 0, just output the character as is. If it is 1, use the following 7 bits as length, and read up the next 2 bytes to use it as distance. This way every back-reference takes 3 bytes, so you can efficiently compress text files with repeating sequences of more than 3 characters long.

But there's a still better way to do this, which gives even more compression: you can replace your characters with bit codes of variable lengths, crafted in such a way that the characters appearing more often would have shortest codes, and those which are rare would have longer codes. To achieve that, these codes have to be so-called "prefix codes", so that no code would be a prefix of some other code. When your codes have this property, you can always distinguish them by reading these bits in sequence until you decode some of them. Then you can be sure that you won't get any other valid item by reading more bits. The next bit always starts another new sequence. To produce such codes, you need to use Huffman trees. You can then join all your bytes and different lengths of references into one such tree and generate distinct bit codes for them, depending on their frequency. When you try to decode them, you just read the bits until you reach the code of some of these elements, and then you know for sure whether it is a code of some literal character or a code for back-reference's length. In the second case, you then read some additional bits for the distance of the back-reference (also encoded with a prefix code). This is what DEFLATE compression scheme does. But this is whole another story, and you will find the details in the RFC supplied by @MarkAdler.

这篇关于LZ77压缩保留字节“ ,&gt;的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆