如何Lucene索引文件? [英] How does lucene index documents?

查看:172
本文介绍了如何Lucene索引文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我也读到了Lucene的一些文件;也是我在这个环节读取该文件 ( http://lucene.sourceforge.net/talks/pisa )。

我真的不明白,索引如何Lucene的文件,并不了解哪些算法使用的Lucene索引?

在上面的链接,它说Lucene的使用这种算法建立索引:

  
      
  • 增量算法:      
        
    • 在保持堆栈段指数
    •   
    • 创建索引每个传入文档
    •   
    • 在推新的索引到堆栈
    •   
    • 令b = 10是合并的因素; M = 8
    •   
  •   

 的(大小= 1;大小<米;尺寸* = B){
    如果(没有为B指数与尺寸的文档在堆栈的顶部){
        弹出他们从堆栈;
        将它们合并成一个单一的索引;
        推合并索引入堆栈;
    } 其他 {
        打破;
    }
}
 

请问这个算法提供优化索引?

请问Lucene的使用B树算法或任何其他算法类似的索引 - ?还是有一个特定的算法

解决方案

有一个相当不错的文章在这里:<一href="https://web.archive.org/web/20130904073403/http://www.ibm.com/developerworks/library/wa-lucene/">https://web.archive.org/web/20130904073403/http://www.ibm.com/developerworks/library/wa-lucene/

编辑二千〇一十四分之十二:更新为存档版本因原被删除,可能是最好的最近的选择​​是<一个href="http://lucene.apache.org/core/3_6_2/fileformats.html">http://lucene.apache.org/core/3_6_2/fileformats.html

还有一个更新版本的<一个href="http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/$c$ccs/lucene410/package-summary.html#package_description">http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/$c$ccs/lucene410/package-summary.html#package_description,但它似乎有更少的信息在它比老之一。

在简单地说,当Lucene索引它分解成若干条款的文件。然后将其存储在其中,每个术语与包含它的文件相关联的索引文件中的条款。你可以把它看成是一个有点像一个哈希表。

使用分析这源于每个单词的根形式产生

条款。在英语中最流行所产生的算法是波特所产生的算法: http://tartarus.org/~martin/PorterStemmer/

在发出一个查询它是通过被用来建立索引,然后用来查找匹配的项指数(S)相同的分析处理。它提供了与查询匹配的文档的列表。

I read some document about Lucene; also I read the document in this link (http://lucene.sourceforge.net/talks/pisa).

I don't really understand how Lucene indexes documents and don't understand which algorithms Lucene uses for indexing?

On the above link, it says Lucene uses this algorithm for indexing:

  • incremental algorithm:
    • maintain a stack of segment indices
    • create index for each incoming document
    • push new indexes onto the stack
    • let b=10 be the merge factor; M=8


for (size = 1; size < M; size *= b) {
    if (there are b indexes with size docs on top of the stack) {
        pop them off the stack;
        merge them into a single index;
        push the merged index onto the stack;
    } else {
        break;
    }
}

How does this algorithm provide optimized indexing?

Does Lucene use B-tree algorithm or any other algorithm like that for indexing - or does it have a particular algorithm?

解决方案

There's a fairly good article here: https://web.archive.org/web/20130904073403/http://www.ibm.com/developerworks/library/wa-lucene/

Edit 12/2014: Updated to an archived version due to the original being deleted, probably the best more recent alternative is http://lucene.apache.org/core/3_6_2/fileformats.html

There's an even more recent version at http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/codecs/lucene410/package-summary.html#package_description, but it seems to have less information in it than the older one.

In a nutshell, when lucene indexes a document it breaks it down into a number of terms. It then stores the terms in an index file where each term is associated with the documents that contain it. You could think of it as a bit like a hashtable.

Terms are generated using an analyzer which stems each word to its root form. The most popular stemming algorithm for the english language is the Porter stemming algorithm: http://tartarus.org/~martin/PorterStemmer/

When a query is issued it is processed through the same analyzer that was used to build the index and then used to look up the matching term(s) in the index. That provides a list of documents that match the query.

这篇关于如何Lucene索引文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆