用于编码单词列表的压缩算法 [英] Compression Algorithm for Encoding Word Lists

查看:176
本文介绍了用于编码单词列表的压缩算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找具体的建议或参考一个算法和/或数据结构来编码一个单词列表,将有效地成为一个拼写检查字典。该方案的目标将导致原始单词列表的非常高的压缩比为编码形式。对编码字典的唯一输出要求是可以以相对有效的方式针对原始单词列表测试任何提出的目标字的存在。例如,应用程序可能想要对100,000个词典检查10,000个单词。编码字典表单的 需要能够[轻松]转换回原始单词表单 - 二进制是/否结果是所有对于针对所得字典测试的每个单词,都需要。



我假设编码方案,以提高压缩率,将利用给定语言中的已知结构,例如单数和复数形式,所有格式,收缩等。我特别感兴趣的是编码主要是英文单词,但要明确的是,该方案必须能够编码任何和所有ASCII文本单词。



我想到的具体应用是可以假设的是嵌入式设备,其中非易失性存储空间是非常重要的,字典将是随机访问的只读存储区。



编辑 :总结字典的要求:




  • 零假阳性

  • 零假阴性

  • 非常高的压缩比

  • 不需要解压缩


解决方案

请参阅McIlroy的开发拼写列表他的酒吧页面 。经典旧纸上的小型计算机上拼写检查,这对您列出的地图令人惊讶地将其映射到令人惊讶的地图上。贴片剥离和两种不同压缩方法的详细分析:Bloom过滤器和相关方案Huffman编码稀疏位元;我会去布鲁姆滤波器,可能不喜欢他选择的方法,这会挤出更多的kB,以极高的速度。 (编程珍珠有关于本文的简短章节。)



另请参阅用于在全文搜索系统中存储词典的方法,例如信息检索简介。与上述方法不同,这没有误报。


I'm am looking for specific suggestions or references to an algorithm and/or data structures for encoding a list of words into what would effectively would turn out to be a spell checking dictionary. The objectives of this scheme would result in a very high compression ratio of the raw word list into the encoded form. The only output requirement I have on the encoded dictionary is that any proposed target word can be tested for existence against the original word list in a relatively efficient manner. For example, the application might want to check 10,000 words against a 100,000 word dictionary. It is not a requirement for the encoded dictionary form to be able to be [easily] converted back into the original word list form - a binary yes/no result is all that is needed for each word tested against the resulting dictionary.

I am assuming the encoding scheme, to improve compression ratio, would take advantage of known structures in a given language such as singular and plural forms, possessive forms, contractions, etc. I am specifically interested in encoding mainly English words, but to be clear, the scheme must be able to encode any and all ASCII text "words".

The particular application I have in mind you can assume is for embedded devices where non-volatile storage space is at a premium and the dictionary would be a randomly accessible read-only memory area.

EDIT: To sum up the requirements of the dictionary:

  • zero false positives
  • zero false negatives
  • very high compression ratio
  • no need for decompression

解决方案

See McIlroy's "Development of a Spelling List" at his pubs page. Classic old paper on spellchecking on a minicomputer, which constraints map surprisingly well onto the ones you listed. Detailed analysis of affix stripping and two different compression methods: Bloom filters and a related scheme Huffman-coding a sparse bitset; I'd go with Bloom filters probably in preference to the method he picked, which squeezes a few more kB out at significant cost in speed. (Programming Pearls has a short chapter about this paper.)

See also the methods used to store the lexicon in full-text search systems, e.g. Introduction to Information Retrieval. Unlike the above methods this has no false positives.

这篇关于用于编码单词列表的压缩算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆