是否有用于“完美"定位的算法?压缩? [英] Is there an algorithm for "perfect" compression?

查看:73
本文介绍了是否有用于“完美"定位的算法?压缩?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我澄清一下,我并不是在说可以压缩任何给定源材料的算法,这是完全不可能的,我意识到这是不可能的.我想要得到的是一种算法,该算法能够将任何位的源字符串编码为由Shannon熵确定的绝对最大压缩状态.

我相信我已经听说过有关霍夫曼编码在某种意义上说是最佳的事情,所以我相信这种加密方案可能基于此,但这是我的问题:

考虑一下位串:a ="101010101010",b ="110100011010".

使用普通的Shannon熵,当我们将位串简单地视为0和1的符号时,这些位串应该具有完全相同的熵,但是这种方法是有缺陷的,因为我们可以直观地看到位串a的熵少于位串. b,因为它只是重复的10的模式.考虑到这一点,我们可以通过计算复合符号00、10、01和11的Shannon熵,更好地了解源的实际熵.

这只是我的理解,我可能完全不了解,但是根据我的理解,对于遍历源来说,确实是随机的,对于长度为n的遍历源来说.所有n个长度的符号组的统计概率必须具有相同的可能性.

我想比标题中的问题更具体,主要有三个问题:

即使在我们以2位符号级别分析字符串时出现明显的模式,使用单个位作为符号的霍夫曼编码也会以最佳方式压缩位串吗?如果不是这样,是否可以通过在霍夫曼编码的不同级别"(如果我在这里称呼术语为不同")循环,直到找到最佳压缩率,来最佳地压缩源?在某些情况下,能否通过霍夫曼编码的不同回合"进一步提高压缩率? (例如,首先使用5位长的符号进行霍夫曼编码,然后对4位长的符号进行霍夫曼编码?huff_4bits(huff_5bits(bitstring)))

解决方案

正如Mark所说,由于Kolmogorov的复杂性,通常的答案是"".让我扩大一点.

压缩基本上是两个步骤: 1)型号 2)熵

模型的作用是猜测"接下来的字节或字段. 模型可以有任何形式,其有效性没有限制. 一个简单的例子是随机数生成器函数:从外部角度看,它看起来像是噪声,因此无法压缩.但是,如果您知道生成函数,则可以将无限长的序列压缩为一小段代码,即生成函数.

这就是为什么没有限制"的原因,而Kolmogorov的复杂性只是指出:您永远不能保证没有更好的方法来建模"数据.

第二部分是可计算的:熵是找到香农极限"的地方. 给定一组符号(通常是模型的输出符号)(它们是字母的一部分),您可以计算最佳成本,并找到一种方法来达到证明的最终压缩极限,即Shannon极限.

就香农限制而言,霍夫曼是最佳选择.如果您接受以下限制:每个符号必须使用整数位数进行编码.这是接近的,但是不完美的近似.可以使用算术编码器提供的小数位或最近基于ANS的有限状态来实现更好的压缩.熵编码器.两者都更接近香农极限.

仅当您单独"对待一组符号时,香农限制才适用.一旦您尝试组合它们"或找到符号之间的任何相关性,即表示您正在建模".这是不可计算的Kolmogorov复杂性领域.

Let me clarify, I'm not talking about perfect compression in the sense of an algorithm that is able to compress any given source material, I realize that is impossible. What I'm trying to get at is an algorithm that is able to encode any source string of bits to it's absolute maximum compressed state, as determined by it's Shannon entropy.

I believe I have heard some things about Huffman Coding being in some sense optimal, so I believe that this encryption scheme might be based off that, but here is my issue:

Consider the bit-strings: a = "101010101010", b = "110100011010".

Using plain Shannon entropy, these bit strings should have the exact same entropy when we consider the bit strings as simply symbols of 0's and 1's, but this approach is flawed, because we can intuitively see that bitstring a has less entropy than bitstring b because it is simply a pattern of repeated 10's. With this in mind, we could get a better idea of the actual entropy of the source by calculating the Shannon entropy for the composite symbols 00, 10, 01, and 11.

This is just my understanding, and I could be totally off base, but from what I understand, for an ergodic source to be truly random, for an ergodic source with length n. the statistical probability of all n-length groups of symbols must be equally likely.

I suppose to be more specific than the question in the title, I have three main questions:

Does Huffman encoding using single bits as symbols compress a bitstring like a optimally, even with an obvious pattern that occurs when we analyze the string at the level of 2-bit symbols? If not, could one optimally compress a source by cycling through different "levels" (sorry if I'm butchering the terminology here) of Huffman coding until the best compression rate is found? Could going through different "rounds" of Huffman coding further increase the compression rate in some instances? (e.a. first go through Huffman coding with symbols that are 5 bits long, then going through Huffman coding for symbols that are 4 bits long? huff_4bits(huff_5bits(bitstring)) )

解决方案

As stated by Mark, the general answer is "no", due to Kolmogorov complexity. Let me expand a bit on that.

Compression is basically two steps : 1) Model 2) Entropy

The role of the model is to "guess" the next bytes or fields to come. Model can have any form, and there is no limit to its effectiveness. A trivial example is a random number generator function : from an external perspective, it looks like a noise, and therefore cannot be compressed. But if you know the generation function, an infinitely long sequence can be compressed into a small set of code, the generator function.

That's why there is "no limit", and Kolmogorov complexity just states that : you can never guarantee that there is not a better way to "model" the data.

The second part is computable : Entropy is where you find the "Shannon Limit". Given a set of symbols (typically, the output symbols from the model), which are part of an alphabet, you can compute the optimal cost, and find a way to reach the proven ultimate compression limit, which is the Shannon limit.

Huffman is optimal with regards to the Shannon limit if you accept the limitation that each symbol must be encoded using an integer number of bits. This is close but imperfect approximation. Better compression can be achieved by using fractional bits, which is what Arithmetic Coders do offer, or the more recent ANS-based Finite State Entropy coder. Both get much closer to the Shannon limit.

The Shannon limit only applies if you treat a set of symbols "individually". As soon as you try to "combine them", or find any correlations between the symbols, you are "modeling". And this is the territory of Kolmogorov Complexity, which is not computable.

这篇关于是否有用于“完美"定位的算法?压缩?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆