快速全文搜索的数据结构 [英] Data structure for fast full text search

查看:83
本文介绍了快速全文搜索的数据结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

特里(trie)似乎适用于较小的字符串,但不适用于较大的文档,因此不确定(1-100页的文字页面)。也许可以将倒排索引与后缀树结合使用,从而获得两全其美的效果。或使用带有单词作为节点存储的b树,并为每个节点指定一个trie。不确定。想知道什么是好的数据结构(b树,链接列表等)。

A trie seems like it would work for small strings, but not for large documents, so not sure (1-100's of pages of text). Maybe it is possible to combine an inverted index with a suffix tree to get the best of both worlds. Or perhaps using a b-tree with words stored as nodes, and a trie for each node. Not sure. Wondering what a good data structure would be (b-tree, linked-list, etc.).

我正在考虑搜索常规书籍,网页等文档和源代码,因此将单词仅存储在反向索引中的想法似乎不太正确。知道您是否需要每种解决方案,或者是否有通用的解决方案适用于所有解决方案或它们的组合,将对您有所帮助。

I'm thinking of searching documents such as regular books, web pages, and source code, so the idea of storing just words in an inverted index doesn't seem quite right. Would be helpful to know if you need alternative solutions for each or if there is a general one that works for them all, or a combination of them.

推荐答案

您确实需要在一天结束时插入一个倒排索引,以使每个查询字词的匹配结果交织在一起,但是可以从Trie或哈希映射构建倒排索引。特里将允许模糊查找,而基于哈希图的反向索引将仅允许精确查找令牌。

You do need an inverted index at the end of the day for interleaving matching results from each of your query terms but an inverted index can be built either from Trie or a Hash Map. A trie would allow fuzzy look-ups, while an hash map based inverted-index would only allow an exact look up of a token.

要针对内存使用进行优化,可以使用Trie的内存优化版本,例如基数树自适应基数树(ART)。使用 ART 在我一直致力于的开源模糊搜索引擎项目中取得了巨大的成功: https://github.com/typesense/typesense

To optimize for memory usage, you can use memory optimized versions of Trie like Radix Tree or Adaptive Radix Tree (ART). I've had great success using ART for an open source fuzzy search engine project I've been working on: https://github.com/typesense/typesense

使用Typesense,我能够为大约1编制索引约165 MB的RAM中包含100万个Hacker News标题(磁盘上未压缩的大小为85 MB)。如果您的用例更具体并且不需要我添加到数据结构中的某些元数据字段,则可以进一步压缩它。

With Typesense, I was able to index about 1 million Hacker News titles in about 165 MB of RAM (uncompressed size on disk was 85 MB). You can probably squeeze it in even further if your use case is more specific and don't need some metadata fields I added to the data structure.

这篇关于快速全文搜索的数据结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆