为什么base64编码的数据压缩如此差? [英] Why does base64-encoded data compress so poorly?

查看:1641
本文介绍了为什么base64编码的数据压缩如此差?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近正在压缩某些文件,并且我发现以base64编码的数据似乎压缩得非常糟糕.这是一个示例:

I was recently compressing some files, and I noticed that base64-encoded data seems to compress really bad. Here is one example:

  • 原始文件: 429,7 MiB
  • 通过xz -9压缩:
    13,2 MiB / 429,7 MiB = 0,031 4,9 MiB/s 1:28
  • base64它并通过xz -9进行压缩:
    26,7 MiB / 580,4 MiB = 0,046 2,6 MiB/s 3:47
  • base64原始压缩的xz文件:
    17,8 MiB几乎没有时间=预期的1.33x尺寸增加
  • Original file: 429,7 MiB
  • compress via xz -9:
    13,2 MiB / 429,7 MiB = 0,031 4,9 MiB/s 1:28
  • base64 it and compress via xz -9:
    26,7 MiB / 580,4 MiB = 0,046 2,6 MiB/s 3:47
  • base64 the original compressed xz file:
    17,8 MiB in almost no time = the expected 1.33x increase in size

所以可以观察到的是:

  • xz压缩非常好☺
  • base64编码的数据不能很好地压缩,它是未编码的压缩文件的2倍
  • base64-then-compress compress-then-base64
  • 差很多,也更慢
  • xz compresses really good ☺
  • base64-encoded data doesn't compress well, it is 2 times larger than the unencoded compressed file
  • base64-then-compress is significantly worse and slower than compress-then-base64

怎么可能? Base64是一种无损,可逆的算法,为什么它会如此严重地影响压缩? (我也尝试过使用gzip,结果相似.)

How can this be? Base64 is a lossless, reversible algorithm, why does it affect compression so much? (I tried with gzip as well, with similar results).

我知道 base64-then-compress 一个文件没有意义,但是大多数时候人们无法控制输入文件,所以我会以为Base64编码文件的实际信息密度(或所谓的信息密度)将与非编码版本几乎相同,因此可以类似地压缩.

I know it doesn't make sense to base64-then-compress a file, but most of the time one doesn't have control over the input files, and I would have thought that since the actual information density (or whatever it is called) of a base64-encoded file would be nearly identical to the non-encoded version, and thus be similarily compressible.

推荐答案

大多数通用压缩算法以 1字节粒度工作.

Most generic compression algorithms work with a one-byte granularity.

让我们考虑以下字符串:

Let's consider the following string:

"XXXXYYYYXXXXYYYY"

  • 运行长度编码算法将说:那是4'X',后跟4'Y',然后是4'X',再是4'Y'"
  • Lempel-Ziv算法将说:这是字符串'XXXXYYYY',后跟相同的字符串:所以让我们将第二个字符串替换为对第一个的引用."
  • 霍夫曼编码算法会说:该字符串中只有2个符号,因此每个符号只能使用一位."
    • A Run-Length-Encoding algorithm will say: "that's 4 'X', followed by 4 'Y', followed by 4 'X', followed by 4 'Y'"
    • A Lempel-Ziv algorithm will say: "That's the string 'XXXXYYYY', followed by the same string: so let's replace the 2nd string with a reference to the 1st one."
    • A Huffman coding algorithm will say: "There are only 2 symbols in that string, so I can use just one bit per symbol."
    • 现在让我们在Base64中编码我们的字符串.这就是我们得到的:

      Now let's encode our string in Base64. Here's what we get:

      "WFhYWFlZWVlYWFhYWVlZWQ=="
      

      所有算法现在都在说:那是什么烂摊子?" .而且他们不太可能很好地压缩该字符串.

      All algorithms are now saying: "What kind of mess is that?". And they're not likely to compress that string very well.

      提醒一下,Base64的工作原理基本上是将(0 ... 255)中的3个字节的组重新编码为(0 ... 63)中的4个字节的组:

      As a reminder, Base64 basically works by re-encoding groups of 3 bytes in (0...255) into groups of 4 bytes in (0...63):

      Input bytes    : aaaaaaaa bbbbbbbb cccccccc
      6-bit repacking: 00aaaaaa 00aabbbb 00bbbbcc 00cccccc
      

      然后将每个输出字节转换为可打印的ASCII字符.按照惯例,这些字符是(这里每10个字符带有一个标记):

      Each output byte is then transformed into a printable ASCII character. By convention, these characters are (here with a mark every 10 characters):

      0         1         2         3         4         5         6
      ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
      

      例如,我们的示例字符串以十六进制(字符"X"的ASCII码)等于0x58的一组三个字节开始.或采用二进制格式:01011000.让我们应用Base64编码:

      For instance, our example string begins with a group of three bytes equal to 0x58 in hexadecimal (ASCII code of character "X"). Or in binary: 01011000. Let's apply Base64 encoding:

      Input bytes      : 0x58     0x58     0x58
      As binary        : 01011000 01011000 01011000
      6-bit repacking  : 00010110 00000101 00100001 00011000
      As decimal       : 22       5        33       24
      Base64 characters: 'W'      'F'      'h'      'Y'
      Output bytes     : 0x57     0x46     0x68     0x59
      

      基本上,在原始数据流中显而易见的模式字节0x58的3倍"在编码数据流中不再明显,因为我们已将字节分成6位数据包并将其映射到新的字节,现在似乎是随机的.

      Basically, the pattern "3 times the byte 0x58" which was obvious in the original data stream is not obvious anymore in the encoded data stream because we've broken the bytes into 6-bit packets and mapped them to new bytes that now appear to be random.

      换句话说:我们已经破坏了大多数压缩算法所依赖的原始字节对齐方式.

      无论使用哪种压缩方法,通常都会严重影响算法性能.这就是为什么您应该始终先压缩然后再编码的原因.

      Whatever compression method is used, it usually has a severe impact on the algorithm performance. That's why you should always compress first and encode second.

      对于加密来说更是如此:首先压缩,然后加密.

      This is even more true for encryption: compress first, encrypt second.

      编辑-关于LZMA的说明

      MSalters注意到,xz正在使用的LZMA在位流而不是字节流上工作.

      As MSalters noticed, LZMA -- which xz is using -- is working on bit streams rather than byte streams.

      尽管如此,该算法也将遭受Base64编码的困扰,这在本质上与我之前的描述是一致的:

      Still, this algorithm will also suffer from Base64 encoding in a way which is essentially consistent with my earlier description:

      Input bytes      : 0x58     0x58     0x58
      As binary        : 01011000 01011000 01011000
      (see above for the details of Base64 encoding)
      Output bytes     : 0x57     0x46     0x68     0x59
      As binary        : 01010111 01000110 01101000 01011001
      

      即使在位级别上工作,也比在输出二进制序列中识别输入二进制序列中的模式要容易得多.

      Even by working at the bit level, it's much easier to recognize a pattern in the input binary sequence than in the output binary sequence.

      这篇关于为什么base64编码的数据压缩如此差?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆