lucene如何索引文件? [英] How does lucene index documents?

查看:22
本文介绍了lucene如何索引文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我阅读了一些关于 Lucene 的文档;我也阅读了此链接中的文档(http://lucene.sourceforge.net/talks/pisa).

I read some document about Lucene; also I read the document in this link (http://lucene.sourceforge.net/talks/pisa).

我不太了解 Lucene 如何索引文档,也不了解 Lucene 使用哪些算法进行索引?

I don't really understand how Lucene indexes documents and don't understand which algorithms Lucene uses for indexing?

在上面的链接中,它说 Lucene 使用这种算法进行索引:

On the above link, it says Lucene uses this algorithm for indexing:

  • 增量算法:
    • 维护一组段索引
    • 为每个传入文档创建索引
    • 将新索引推入堆栈
    • 设 b=10 为合并因子;M=8

    <小时>

    for (size = 1; size < M; size *= b) {
        if (there are b indexes with size docs on top of the stack) {
            pop them off the stack;
            merge them into a single index;
            push the merged index onto the stack;
        } else {
            break;
        }
    }
    

    此算法如何提供优化的索引?

    How does this algorithm provide optimized indexing?

    Lucene 是否使用 B-tree 算法或任何其他类似的算法进行索引- 或者它有特定的算法?

    Does Lucene use B-tree algorithm or any other algorithm like that for indexing - or does it have a particular algorithm?

    推荐答案

    简而言之,Lucene 使用 Skip-Lists 磁盘上,然后加载索引词的映射进入内存 使用 有限状态传感器 (FST).但是请注意,Lucene 不会(必然)加载所有正如 Lucene 索引系统的作者 Michael McCandless 所描述的那样,将术语索引到 RAM.请注意,通过使用 Skip-Lists,索引可以从一个命中遍历到另一个命中,使得 set 之类的事情,尤其是 range 查询 成为可能(很像 B-Trees).Wikipedia entry on indexing Skip-Lists 也解释了为什么调用 Lucene 的 Skip-List 实现多级跳过列表 - 本质上,使 O(log n) 查找成为可能(再次,很像 B-Trees).

    In a nutshell, Lucene builds an inverted index using Skip-Lists on disk, and then loads a mapping for the indexed terms into memory using a Finite State Transducer (FST). Note, however, that Lucene does not (necessarily) load all indexed terms to RAM, as described by Michael McCandless, the author of Lucene's indexing system himself. Note that by using Skip-Lists, the index can be traversed from one hit to another, making things like set and, particularly, range queries possible (much like B-Trees). And the Wikipedia entry on indexing Skip-Lists also explains why Lucene's Skip-List implementation is called a multi-level Skip-List - essentially, to make O(log n) look-ups possible (again, much like B-Trees).

    因此,一旦倒排(术语)索引 - 基于 Skip-List 数据结构 - 从文档构建,索引存储在磁盘上.然后,Lucene 将这些术语加载(如前所述:可能,只是其中的一部分)到 有限状态转换器,在 FST 实现中 松散的启发 by Morfologick.

    So once the inverted (term) index - which is based on a Skip-List data structure - is built from the documents, the index is stored on disk. Lucene then loads (as already said: possibly, only some of) those terms into a Finite State Transducer, in an FST implementation loosely inspired by Morfologick.

    Michael McCandless(也)在 解释了 Lucene 如何以及为什么使用(最小无环)FST 来索引 Lucene 存储在内存中的术语,本质上是作为 SortedMap,并给出了一个基本概念了解 FST 如何工作(即 FST 如何压缩字节序列 [即索引项] 以使该映射的内存使用增长为次线性).他指出 描述特定 FST 算法的论文 Lucene 也使用.

    Michael McCandless (also) does a pretty good and terse job of explaining how and why Lucene uses a (minimal acyclic) FST to index the terms Lucene stores in memory, essentially as a SortedMap<ByteSequence,SomeOutput>, and gives a basic idea for how FSTs work (i.e., how the FST compacts the byte sequences [i.e., the indexed terms] to make the memory use of this mapping grow sub-linear). And he points to the paper that describes the particular FST algorithm Lucene uses, too.

    对于那些好奇为什么 Lucene 使用 Skip-Lists 而大多数数据库使用 (B+)- 和/或 (B)-Trees 的人,请查看 关于这个问题的正确答案(Skip-Lists vs. B-Trees).这个答案给出了一个非常好的、深入的解释——本质上,没有让索引的并发更新更适合"(因为您可以决定不立即重新平衡 B-Tree,从而获得与跳过列表大致相同的并发性能),但是,跳过列表使您不必处理(延迟或不延迟)平衡操作(最终)B-Trees 需要(事实上,正如答案显示/引用的那样,B-Trees 和 [多级] 跳过列表之间的性能差异可能很小,如果其中任何一个正确完成".)

    For those curious why Lucene uses Skip-Lists, while most databases use (B+)- and/or (B)-Trees, take a look at the right SO answer regarding this question (Skip-Lists vs. B-Trees). That answer gives a pretty good, deep explanation - essentially, not so much make concurrent updates of the index "more amenable" (because you can decide to not re-balance a B-Tree immediately, thereby gaining about the same concurrent performance as a Skip-List), but rather, Skip-Lists save you from having to work on the (delayed or not) balancing operation (ultimately) needed by B-Trees (In fact, as the answer shows/references, there is probably very little performance difference between B-Trees and [multi-level] Skip-Lists, if either are "done right.")

    这篇关于lucene如何索引文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆