文件可以压缩多少次? [英] How many times can a file be compressed?

查看:289
本文介绍了文件可以压缩多少次?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在想压缩,它似乎会有一些限制,可以应用于它的压缩,否则它将是一个单个字节。



所以我的问题是,我可以压缩一个文件多少次之前:





  • 文件已损坏?



这两个点是相同还是不同?



如何找到这些积分?



我不是在谈论任何特定的算法或特定文件。

解决方案

对于无损压缩,唯一的方法,你可以知道你可以通过重新压缩文件获得多少次是通过尝试。它将取决于压缩算法和您压缩的文件。



两个文件不能压缩到相同的输出,所以你不能下去一个字节。一个字节如何表示可以解压到的所有文件?



第二次压缩有时工作的原因是压缩算法不能做全向完美压缩。在它需要做的工作和做它需要的时间之间有一个权衡。



示例



以运行长度编码(可能是最简单的有用压缩)为例。



04 04 04 04 43 43 43 43 51 52 11个字节



该系列字节可压缩为:



[4] 04 [4] 43 [-2] 51 52 7个字节(我将元数据放在括号中) b
$ b

括号中的正数是重复计数,括号中的负数是发现找到的下一个-n个字符的命令。



在这种情况下,我们可以再尝试一次压缩:



[3] 04 [-4] 43 fe 51 52 7字节(fe是您的-2看作二进制补码数据)



我们没有获得任何东西,我们将在下一次迭代:



[ - 7] 03 04 fc 43 fe 51 52 8个字节 b

我们将在每次迭代中增加一个字节,但实际上会变得更糟。一个字节只能将负数保存到-128。当文件长度超过128字节时,我们将开始增长两个字节。



对压缩程序 - 元数据有一个逆风。而且,对于真实的压缩器,头部粘贴到文件的开头。这意味着最终文件将随着每次额外的压缩而开始增长。






RLE是一个起点。如果您想了解详情,请参阅 LZ77 (它会回溯到文件中查找模式)和 LZ78 (构建字典)。压缩机喜欢zip经常尝试多个算法,并使用最好的一个。



这里有一些情况下,我可以想到多个压缩的工作。


  1. 我在一个装有磁盘的Amiga杂志工作。当然,我们把磁盘装到鳃上。我们使用的工具之一,让你打包一个可执行文件,以便当它运行时,它解压缩和运行自己。因为解压缩算法必须在每个可执行文件中,它必须小而简单。我们经常通过压缩两次获得额外的收益。解压缩在RAM中完成。由于读取软盘的速度很慢,我们通常也能提高速度。

  2. Microsoft支持对bmp文件进行RLE压缩。此外,许多字处理器做RLE编码。 RLE文件几乎总是可以通过更好的压缩器显着压缩。

  3. 我工作的很多游戏都使用了一个小型,快速的LZ77解压缩程序。如果你压缩一个大的矩形像素(特别是如果它有很多背景颜色,或者它是一个动画),你可以经常压缩两次,结果良好。 (原因?你只有这么多位来指定回溯距离和长度,所以一个大的重复模式被编码成几个部分,那些部分是高度可压缩的。)


I was thinking about compression, and it seems like there would have to be some sort of limit to the compression that could be applied to it, otherwise it'd be a single byte.

So my question is, how many times can I compress a file before:

  • It does not get any smaller?
  • The file becomes corrupt?

Are these two points the same or different?

Where does the point of diminishing returns appear?

How can these points be found?

I'm not talking about any specific algorithm or particular file, just in general.

解决方案

For lossless compression, the only way you can know how many times you can gain by recompressing a file is by trying. It's going to depend on the compression algorithm and the file you're compressing.

Two files can never compress to the same output, so you can't go down to one byte. How could one byte represent all the files you could decompress to?

The reason that the second compression sometimes works is that a compression algorithm can't do omniscient perfect compression. There's a trade-off between the work it has to do and the time it takes to do it. Your file is being changed from all data to a combination of data about your data and the data itself.

Example

Take run-length encoding (probably the simplest useful compression) as an example.

04 04 04 04 43 43 43 43 51 52 11 bytes

That series of bytes could be compressed as:

[4] 04 [4] 43 [-2] 51 52 7 bytes (I'm putting meta data in brackets)

Where the positive number in brackets is a repeat count and the negative number in brackets is a command to emit the next -n characters as they are found.

In this case we could try one more compression:

[3] 04 [-4] 43 fe 51 52 7 bytes (fe is your -2 seen as two's complement data)

We gained nothing, and we'll start growing on the next iteration:

[-7] 03 04 fc 43 fe 51 52 8 bytes

We'll grow by one byte per iteration for a while, but it will actually get worse. One byte can only hold negative numbers to -128. We'll start growing by two bytes when the file surpasses 128 bytes in length. The growth will get still worse as the file gets bigger.

There's a headwind blowing against the compression program--the meta data. And also, for real compressors, the header tacked on to the beginning of the file. That means that eventually the file will start growing with each additional compression.


RLE is a starting point. If you want to learn more, look at LZ77 (which looks back into the file to find patterns) and LZ78 (which builds a dictionary). Compressors like zip often try multiple algorithms and use the best one.

Here are some cases I can think of where multiple compression has worked.

  1. I worked at an Amiga magazine that shipped with a disk. Naturally, we packed the disk to the gills. One of the tools we used let you pack an executable so that when it was run, it decompressed and ran itself. Because the decompression algorithm had to be in every executable, it had to be small and simple. We often got extra gains by compressing twice. The decompression was done in RAM. Since reading a floppy was slow, we often got a speed increase as well!
  2. Microsoft supported RLE compression on bmp files. Also, many word processors did RLE encoding. RLE files are almost always significantly compressible by a better compressor.
  3. A lot of the games I worked on used a small, fast LZ77 decompressor. If you compress a large rectangle of pixels (especially if it has a lot of background color, or if it's an animation), you can very often compress twice with good results. (The reason? You only have so many bits to specify the lookback distance and the length, So a single large repeated pattern is encoded in several pieces, and those pieces are highly compressible.)

这篇关于文件可以压缩多少次?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆