lucene是如何索引文档的? [英] How does lucene index documents?

查看:18
本文介绍了lucene是如何索引文档的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我阅读了一些关于 Lucene 的文档;我也阅读了此链接中的文档(http://lucene.sourceforge.net/talks/pisa).

I read some document about Lucene; also I read the document in this link (http://lucene.sourceforge.net/talks/pisa).

我不太了解 Lucene 如何索引文档,也不了解 Lucene 用于索引的算法?

I don't really understand how Lucene indexes documents and don't understand which algorithms Lucene uses for indexing?

在上面的链接上,它说 Lucene 使用这种算法进行索引:

On the above link, it says Lucene uses this algorithm for indexing:

  • 增量算法:
    • 维护一堆段索引
    • 为每个传入的文档创建索引
    • 将新索引推入堆栈
    • 让 b=10 为合并因子;M=8

    <小时>

    for (size = 1; size < M; size *= b) {
        if (there are b indexes with size docs on top of the stack) {
            pop them off the stack;
            merge them into a single index;
            push the merged index onto the stack;
        } else {
            break;
        }
    }
    

    该算法如何提供优化的索引?

    How does this algorithm provide optimized indexing?

    Lucene 是否使用 B 树算法或任何其他类似的算法进行索引- 或者它有特定的算法吗?

    Does Lucene use B-tree algorithm or any other algorithm like that for indexing - or does it have a particular algorithm?

    推荐答案

    简而言之,Lucene 使用 Skip-Lists 在磁盘上,然后加载索引词的映射使用 >有限状态转换器 (FST).但是请注意,Lucene 不会(必须)加载所有将术语索引到 RAM,正如 Lucene 索引系统的作者 Michael McCandless 所描述的那样.请注意,通过使用跳过列表,索引可以从一个命中遍历到另一个,从而使set 之类的事情成为可能,尤其是范围查询(很像 B 树).而维基百科关于索引跳过列表的条目也解释了为什么调用 Lucene 的跳过列表实现多级 Skip-List - 本质上,使 O(log n) 查找成为可能(同样,很像 B 树).

    In a nutshell, Lucene builds an inverted index using Skip-Lists on disk, and then loads a mapping for the indexed terms into memory using a Finite State Transducer (FST). Note, however, that Lucene does not (necessarily) load all indexed terms to RAM, as described by Michael McCandless, the author of Lucene's indexing system himself. Note that by using Skip-Lists, the index can be traversed from one hit to another, making things like set and, particularly, range queries possible (much like B-Trees). And the Wikipedia entry on indexing Skip-Lists also explains why Lucene's Skip-List implementation is called a multi-level Skip-List - essentially, to make O(log n) look-ups possible (again, much like B-Trees).

    因此,一旦倒排(术语)索引 - 基于 Skip-List 数据结构 - 从文件中构建,索引存储在磁盘上.然后 Lucene 将(正如已经说过的:可能只是其中的一部分)这些术语加载到 有限状态转换器,在 FST 实现中 松散启发 来自 Morfologick.

    So once the inverted (term) index - which is based on a Skip-List data structure - is built from the documents, the index is stored on disk. Lucene then loads (as already said: possibly, only some of) those terms into a Finite State Transducer, in an FST implementation loosely inspired by Morfologick.

    Michael McCandless(也)在 解释 Lucene 如何以及为什么使用(最小非循环)FST 来索引 Lucene 存储在内存中的术语,本质上是一个 SortedMap,并给出了一个基本思想了解 FST 如何工作(即 FST 如何压缩字节序列 [即索引项] 以使该映射的内存使用增长为次线性).他指出描述特定 FST 算法的论文 Lucene 也使用.

    Michael McCandless (also) does a pretty good and terse job of explaining how and why Lucene uses a (minimal acyclic) FST to index the terms Lucene stores in memory, essentially as a SortedMap<ByteSequence,SomeOutput>, and gives a basic idea for how FSTs work (i.e., how the FST compacts the byte sequences [i.e., the indexed terms] to make the memory use of this mapping grow sub-linear). And he points to the paper that describes the particular FST algorithm Lucene uses, too.

    对于那些好奇为什么 Lucene 使用 Skip-Lists 而大多数数据库使用 (B+)- 和/或 (B)-Trees 的人,请查看 关于这个问题的正确 SO 答案(跳过列表与 B 树).这个答案给出了一个很好的、深刻的解释——本质上,不是使索引的并发更新更适合"(因为你可以决定不立即重新平衡 B 树,从而获得与跳过列表的并发性能大致相同),但是跳过列表使您不必从事(延迟或不延迟)平衡操作(最终)B-Trees 需要(事实上,正如答案所示/参考资料,B-Trees 和 [multi-level] Skip-Lists 之间的性能差异可能很小,如果两者都做得对".)

    For those curious why Lucene uses Skip-Lists, while most databases use (B+)- and/or (B)-Trees, take a look at the right SO answer regarding this question (Skip-Lists vs. B-Trees). That answer gives a pretty good, deep explanation - essentially, not so much make concurrent updates of the index "more amenable" (because you can decide to not re-balance a B-Tree immediately, thereby gaining about the same concurrent performance as a Skip-List), but rather, Skip-Lists save you from having to work on the (delayed or not) balancing operation (ultimately) needed by B-Trees (In fact, as the answer shows/references, there is probably very little performance difference between B-Trees and [multi-level] Skip-Lists, if either are "done right.")

    这篇关于lucene是如何索引文档的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆