COM pression算法编码单词列表 [英] Compression Algorithm for Encoding Word Lists

查看:106
本文介绍了COM pression算法编码单词列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要寻找的具体建议或引用编码单词列表入什么会有效地将变成是一个拼写检查字典的算法和/或数据结构。该方案的目标将导致原单词列表进入恩codeD形成了一个非常高的COM pression比例。唯一的输出要求我对恩$ C $光盘词典是,任何建议的目标字可以为存在与原始词列表以相对有效的方式进行测试。例如,应用程序可能需要检查10000字对10万字的字典。它的没有作为连接codeD词典形式要求能够为[轻松]转换回原来的单词列表形式 - 二进制是/否结果所有这一切都需要对所产生的字典测试的每一个字。

I'm am looking for specific suggestions or references to an algorithm and/or data structures for encoding a list of words into what would effectively would turn out to be a spell checking dictionary. The objectives of this scheme would result in a very high compression ratio of the raw word list into the encoded form. The only output requirement I have on the encoded dictionary is that any proposed target word can be tested for existence against the original word list in a relatively efficient manner. For example, the application might want to check 10,000 words against a 100,000 word dictionary. It is not a requirement for the encoded dictionary form to be able to be [easily] converted back into the original word list form - a binary yes/no result is all that is needed for each word tested against the resulting dictionary.

我假设的编码方案,以提高玉米pression比例,将采取在一个给定的语言,如单复数形式,所有格形式,宫缩等利用已知的结构我在编码特别感兴趣,主要英语单词,但要清楚,该方案必须能够连接code任何和所有的ASCII文本的话。

I am assuming the encoding scheme, to improve compression ratio, would take advantage of known structures in a given language such as singular and plural forms, possessive forms, contractions, etc. I am specifically interested in encoding mainly English words, but to be clear, the scheme must be able to encode any and all ASCII text "words".

的特定应用我想到可以假定为嵌入式设备,其中非易失性存储空间很premium和字典将是可随机存取的只读存储器区

The particular application I have in mind you can assume is for embedded devices where non-volatile storage space is at a premium and the dictionary would be a randomly accessible read-only memory area.

修改:综上所述字典的要求:

EDIT: To sum up the requirements of the dictionary:

  • 在零误报
  • 在零漏报
  • 在非常高的COM pression比
  • 无需DECOM pression

推荐答案

见的麦克罗伊的发展拼写列表他的酒馆页。经典旧纸的拼写检查上一台小型机,这限制映射出奇地好,你上列出的。详细分析词缀剥离和两个不同的COM pression方法:布鲁姆过滤器和相关的方案霍夫曼编码一个稀疏的位集合;我会去与布鲁姆在preference可能筛选到他挑选的方法,它挤压了几个KB出在速度显著成本。 (编程珠玑的有一个很短章对本文。)

See McIlroy's "Development of a Spelling List" at his pubs page. Classic old paper on spellchecking on a minicomputer, which constraints map surprisingly well onto the ones you listed. Detailed analysis of affix stripping and two different compression methods: Bloom filters and a related scheme Huffman-coding a sparse bitset; I'd go with Bloom filters probably in preference to the method he picked, which squeezes a few more kB out at significant cost in speed. (Programming Pearls has a short chapter about this paper.)

又见用来存储在全文检索系统,如词汇的方法现代信息检索的。不同于上述方法这个没有误报。

See also the methods used to store the lexicon in full-text search systems, e.g. Introduction to Information Retrieval. Unlike the above methods this has no false positives.

这篇关于COM pression算法编码单词列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆