为什么base64编码的数据压缩如此差? [英] Why does base64-encoded data compress so poorly?

查看：1641 发布时间：2020/6/29 21:10:34 compression lossless-compression

本文介绍了为什么base64编码的数据压缩如此差?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我最近正在压缩某些文件，并且我发现以base64编码的数据似乎压缩得非常糟糕.这是一个示例:

I was recently compressing some files, and I noticed that base64-encoded data seems to compress really bad. Here is one example:

原始文件: 429,7 MiB
通过xz -9压缩:
13,2 MiB / 429,7 MiB = 0,031 4,9 MiB/s 1:28
base64它并通过xz -9进行压缩:
26,7 MiB / 580,4 MiB = 0,046 2,6 MiB/s 3:47
base64原始压缩的xz文件:
17,8 MiB几乎没有时间=预期的1.33x尺寸增加

Original file: 429,7 MiB
compress via xz -9:
13,2 MiB / 429,7 MiB = 0,031 4,9 MiB/s 1:28
base64 it and compress via xz -9:
26,7 MiB / 580,4 MiB = 0,046 2,6 MiB/s 3:47
base64 the original compressed xz file:
17,8 MiB in almost no time = the expected 1.33x increase in size

所以可以观察到的是:

xz压缩非常好☺
base64编码的数据不能很好地压缩，它是未编码的压缩文件的2倍
base64-then-compress 比 compress-then-base64

xz compresses really good ☺
base64-encoded data doesn't compress well, it is 2 times larger than the unencoded compressed file
base64-then-compress is significantly worse and slower than compress-then-base64

怎么可能? Base64是一种无损，可逆的算法，为什么它会如此严重地影响压缩? (我也尝试过使用gzip，结果相似.)

How can this be? Base64 is a lossless, reversible algorithm, why does it affect compression so much? (I tried with gzip as well, with similar results).

我知道 base64-then-compress 一个文件没有意义，但是大多数时候人们无法控制输入文件，所以我会以为Base64编码文件的实际信息密度(或所谓的信息密度)将与非编码版本几乎相同，因此可以类似地压缩.

I know it doesn't make sense to base64-then-compress a file, but most of the time one doesn't have control over the input files, and I would have thought that since the actual information density (or whatever it is called) of a base64-encoded file would be nearly identical to the non-encoded version, and thus be similarily compressible.

推荐答案

大多数通用压缩算法以 1字节粒度工作.

Most generic compression algorithms work with a one-byte granularity.

让我们考虑以下字符串:

Let's consider the following string:

"XXXXYYYYXXXXYYYY"

运行长度编码算法将说:那是4'X'，后跟4'Y'，然后是4'X'，再是4'Y'"
Lempel-Ziv算法将说:这是字符串'XXXXYYYY'，后跟相同的字符串:所以让我们将第二个字符串替换为对第一个的引用."
霍夫曼编码算法会说:该字符串中只有2个符号，因此每个符号只能使用一位."

A Run-Length-Encoding algorithm will say: "that's 4 'X', followed by 4 'Y', followed by 4 'X', followed by 4 'Y'"
A Lempel-Ziv algorithm will say: "That's the string 'XXXXYYYY', followed by the same string: so let's replace the 2nd string with a reference to the 1st one."
A Huffman coding algorithm will say: "There are only 2 symbols in that string, so I can use just one bit per symbol."

现在让我们在Base64中编码我们的字符串.这就是我们得到的:

Now let's encode our string in Base64. Here's what we get:

"WFhYWFlZWVlYWFhYWVlZWQ=="

所有算法现在都在说:那是什么烂摊子?" .而且他们不太可能很好地压缩该字符串.

All algorithms are now saying: "What kind of mess is that?". And they're not likely to compress that string very well.

提醒一下，Base64的工作原理基本上是将(0 ... 255)中的3个字节的组重新编码为(0 ... 63)中的4个字节的组:

As a reminder, Base64 basically works by re-encoding groups of 3 bytes in (0...255) into groups of 4 bytes in (0...63):

Input bytes    : aaaaaaaa bbbbbbbb cccccccc
6-bit repacking: 00aaaaaa 00aabbbb 00bbbbcc 00cccccc

然后将每个输出字节转换为可打印的ASCII字符.按照惯例，这些字符是(这里每10个字符带有一个标记):

Each output byte is then transformed into a printable ASCII character. By convention, these characters are (here with a mark every 10 characters):

0         1         2         3         4         5         6
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/

例如，我们的示例字符串以十六进制(字符"X"的ASCII码)等于0x58的一组三个字节开始.或采用二进制格式:01011000.让我们应用Base64编码:

For instance, our example string begins with a group of three bytes equal to 0x58 in hexadecimal (ASCII code of character "X"). Or in binary: 01011000. Let's apply Base64 encoding:

Input bytes      : 0x58     0x58     0x58
As binary        : 01011000 01011000 01011000
6-bit repacking  : 00010110 00000101 00100001 00011000
As decimal       : 22       5        33       24
Base64 characters: 'W'      'F'      'h'      'Y'
Output bytes     : 0x57     0x46     0x68     0x59

基本上，在原始数据流中显而易见的模式字节0x58的3倍"在编码数据流中不再明显，因为我们已将字节分成6位数据包并将其映射到新的字节，现在似乎是随机的.

Basically, the pattern "3 times the byte 0x58" which was obvious in the original data stream is not obvious anymore in the encoded data stream because we've broken the bytes into 6-bit packets and mapped them to new bytes that now appear to be random.

换句话说:我们已经破坏了大多数压缩算法所依赖的原始字节对齐方式.

无论使用哪种压缩方法，通常都会严重影响算法性能.这就是为什么您应该始终先压缩然后再编码的原因.

Whatever compression method is used, it usually has a severe impact on the algorithm performance. That's why you should always compress first and encode second.

对于加密来说更是如此:首先压缩，然后加密.

This is even more true for encryption: compress first, encrypt second.

编辑-关于LZMA的说明

MSalters注意到，xz正在使用的LZMA在位流而不是字节流上工作.

As MSalters noticed, LZMA -- which xz is using -- is working on bit streams rather than byte streams.

尽管如此，该算法也将遭受Base64编码的困扰，这在本质上与我之前的描述是一致的:

Still, this algorithm will also suffer from Base64 encoding in a way which is essentially consistent with my earlier description:

Input bytes      : 0x58     0x58     0x58
As binary        : 01011000 01011000 01011000
(see above for the details of Base64 encoding)
Output bytes     : 0x57     0x46     0x68     0x59
As binary        : 01010111 01000110 01101000 01011001

即使在位级别上工作，也比在输出二进制序列中识别输入二进制序列中的模式要容易得多.

Even by working at the bit level, it's much easier to recognize a pattern in the input binary sequence than in the output binary sequence.

这篇关于为什么base64编码的数据压缩如此差?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么base64编码的数据压缩如此差? [英] Why does base64-encoded data compress so poorly?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么base64编码的数据压缩如此差? [英] Why does base64-encoded data compress so poorly?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭