纯文本压缩算法的当前状态是什么? [英] What is the current state of text-only compression algorithms?

查看:110
本文介绍了纯文本压缩算法的当前状态是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了纪念哈特奖, 文本压缩的顶级算法是什么(以及每种算法的快速描述)?

In honor of the Hutter Prize, what are the top algorithms (and a quick description of each) for text compression?

注意:这个问题的目的是为了描述压缩算法,而不是压缩程序.

Note: The intent of this question is to get a description of compression algorithms, not of compression programs.

推荐答案

按边界压缩的算法结合了用于疯狂结果的算法.常见的算法包括:

The boundary-pushing compressors combine algorithms for insane results. Common algorithms include:

  • Burrows-Wheeler转换通过部分匹配(PPM)进行预测 -a ="http://en.wikipedia.org/wiki/Arithmetic_coding" rel ="noreferrer">算术编码,其中预测模型(上下文)是通过处理有关源的统计信息(使用静态概率)来创建的.即使其根源是算术编码,也可以用霍夫曼编码或字典以及算术编码来表示结果.
  • 上下文混合-算术编码使用静态上下文进行预测,PPM动态选择一个上下文,上下文混合使用许多上下文并权衡结果. PAQ使用上下文混合. 此处是高级概述.
  • 动态马尔可夫压缩-与PPM有关,但使用位级上下文而不是字节或更长.
  • 此外,哈特奖的参赛者可以用外部字典中的小字节条目替换普通文本,并使用特殊符号(而不是两个不同的条目)来区分大小写文本.这就是为什么他们擅长压缩文本(尤其是ASCII文本),而不是一般压缩的重要原因.
  • The Burrows-Wheeler Transform and here - shuffle characters (or other bit blocks) with a predictable algorithm to increase repeated blocks which makes the source easier to compress. Decompression occurs as normal and the result is un-shuffled with the reverse transform. Note: BWT alone doesn't actually compress anything. It just makes the source easier to compress.
  • Prediction by Partial Matching (PPM) - an evolution of arithmetic coding where the prediction model(context) is created by crunching statistics about the source versus using static probabilities. Even though its roots are in arithmetic coding, the result can be represented with Huffman encoding or a dictionary as well as arithmetic coding.
  • Context Mixing - Arithmetic coding uses a static context for prediction, PPM dynamically chooses a single context, Context Mixing uses many contexts and weighs their results. PAQ uses context mixing. Here's a high-level overview.
  • Dynamic Markov Compression - related to PPM but uses bit-level contexts versus byte or longer.
  • In addition, the Hutter prize contestants may replace common text with small-byte entries from external dictionaries and differentiate upper and lower case text with a special symbol versus using two distinct entries. That's why they're so good at compressing text (especially ASCII text) and not as valuable for general compression.

最大压缩是一个非常酷的文本和常规压缩基准站点. Matt Mahoney发布了另一个基准. Mahoney可能特别感兴趣,因为它列出了每个条目使用的主要算法.

Maximum Compression is a pretty cool text and general compression benchmark site. Matt Mahoney publishes another benchmark. Mahoney's may be of particular interest because it lists the primary algorithm used per entry.

这篇关于纯文本压缩算法的当前状态是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆