在COM pressed文本文件快速搜索 [英] Fast search in compressed text files

查看:164
本文介绍了在COM pressed文本文件快速搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要能够搜索文本中的大量文件(.txt)被压缩的。 COM pression可能会更改为别的东西,甚至成为了专有的。 我想避免拆包中的所有文件和COM preSS(EN code)的搜索字符串和搜索融为一体pressed文件。这应该使用霍夫曼COM pression与同codebook的所有文件成为可能。 我不想重新发明轮子,所以..任何人都知道,做这样的事情,或者是实现和测试,或霍夫曼算法库,也许一个更好的主意吗?

I need to be able to search for text in a large number of files (.txt) that are zipped. Compression may be changed to something else or even became proprietary. I want to avoid unpacking all files and compress (encode) the search string and search in compressed files. This should be possible using Huffman compression with the same codebook for all files. I don't want to re-invent the wheel, so .. anyone knows a library that does something like this or Huffman algorithm that is implemented and tested, or maybe a better idea ?

在此先感谢

推荐答案

大多数文本文件COM pressed与算法,其中词典codeR 用相结合LZ-家庭 < A HREF =htt​​p://en.wikipedia.org/wiki/Minimum_redundancy_coding>熵codeR 如Huffman。

Most text files are compressed with one of the LZ-family of algorithms, which combine a Dictionary Coder together with an Entropy Coder such as Huffman.

由于该字典codeR依赖于一个连续更新的字典,其编码的结果是依赖于历史(全部codeS在字典,从输入数据中得出到当前符号) ,所以这是不可能跳进一定位置,并开始解码,而不先进行解码的所有previous数据的

Because the Dictionary Coder relies on a continuously-updated "dictionary", its coding result is dependent on the history (all codes in the dictionary that is derived from the input data up to the current symbol), so it is not possible to jump into a certain location and start decoding, without first decoding all of the previous data.

在我看来,你可以使用zlib的流解codeR返回DECOM pressed数据,因为它不用等待整个文件是DECOM pressed。这不会节省的执行时间,但可以节省内存。

In my opinion, you can just use a zlib stream decoder which returns decompressed data as it goes without waiting for the entire file to be decompressed. This will not save execution time but will save memory.

第二个建议是做Huffman编码的英文单词,而忽略了词典codeR一部分。每个英文单词被映射到一个独特的preFIX免费code。

A second suggestion is to do Huffman coding on English words, and forget about the Dictionary Coder part. Each English word gets mapped to a unique prefix-free code.

最后,@SHODAN给最明智的建议,这是对索引文件,玉米preSS的索引和捆绑与玉米pressed文本文件。做搜索,DECOM preSS只是索引文件和查单词。这实际上是一种改进做哈夫曼编码的话 - 一旦你找到的词的频率(以到C优化分配preFIX $ C $),你已经建立了索引,这样你就可以保持索引搜索。

Finally, @SHODAN gave the most sensible suggestion, which is to index the files, compress the index and bundle with the compressed text files. To do a search, decompress just the index file and look up the words. This is in fact an improvement over doing the Huffman coding on words - once you found the frequency of words (in order to assign the prefix code optimally), you have already built the index, so you can keep the index for searching.

这篇关于在COM pressed文本文件快速搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆