用于编码单词列表的压缩算法 [英] Compression Algorithm for Encoding Word Lists
问题描述
我假设编码方案,以提高压缩率,将利用给定语言中的已知结构,例如单数和复数形式,所有格式,收缩等。我特别感兴趣的是编码主要是英文单词,但要明确的是,该方案必须能够编码任何和所有ASCII文本单词。
我想到的具体应用是可以假设的是嵌入式设备,其中非易失性存储空间是非常重要的,字典将是随机访问的只读存储区。
编辑 :总结字典的要求:
- 零假阳性
- 零假阴性
- 非常高的压缩比
- 不需要解压缩
请参阅McIlroy的开发拼写列表,他的酒吧页面 。经典旧纸上的小型计算机上拼写检查,这对您列出的地图令人惊讶地将其映射到令人惊讶的地图上。贴片剥离和两种不同压缩方法的详细分析:Bloom过滤器和相关方案Huffman编码稀疏位元;我会去布鲁姆滤波器,可能不喜欢他选择的方法,这会挤出更多的kB,以极高的速度。 (编程珍珠有关于本文的简短章节。)
另请参阅用于在全文搜索系统中存储词典的方法,例如信息检索简介。与上述方法不同,这没有误报。
I'm am looking for specific suggestions or references to an algorithm and/or data structures for encoding a list of words into what would effectively would turn out to be a spell checking dictionary. The objectives of this scheme would result in a very high compression ratio of the raw word list into the encoded form. The only output requirement I have on the encoded dictionary is that any proposed target word can be tested for existence against the original word list in a relatively efficient manner. For example, the application might want to check 10,000 words against a 100,000 word dictionary. It is not a requirement for the encoded dictionary form to be able to be [easily] converted back into the original word list form - a binary yes/no result is all that is needed for each word tested against the resulting dictionary.
I am assuming the encoding scheme, to improve compression ratio, would take advantage of known structures in a given language such as singular and plural forms, possessive forms, contractions, etc. I am specifically interested in encoding mainly English words, but to be clear, the scheme must be able to encode any and all ASCII text "words".
The particular application I have in mind you can assume is for embedded devices where non-volatile storage space is at a premium and the dictionary would be a randomly accessible read-only memory area.
EDIT: To sum up the requirements of the dictionary:
- zero false positives
- zero false negatives
- very high compression ratio
- no need for decompression
See McIlroy's "Development of a Spelling List" at his pubs page. Classic old paper on spellchecking on a minicomputer, which constraints map surprisingly well onto the ones you listed. Detailed analysis of affix stripping and two different compression methods: Bloom filters and a related scheme Huffman-coding a sparse bitset; I'd go with Bloom filters probably in preference to the method he picked, which squeezes a few more kB out at significant cost in speed. (Programming Pearls has a short chapter about this paper.)
See also the methods used to store the lexicon in full-text search systems, e.g. Introduction to Information Retrieval. Unlike the above methods this has no false positives.
这篇关于用于编码单词列表的压缩算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!