如何在Lucene 3.0.2中索引和搜索文本文件? [英] How do I index and search text files in Lucene 3.0.2?

查看:97
本文介绍了如何在Lucene 3.0.2中索引和搜索文本文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Lucene的新手,我在创建简单代码查询文本文件集时遇到了一些问题。

I am newbie in Lucene, and I'm having some problems creating simple code to query a text file collection.

我试过 this示例,但与新版本的Lucene不兼容。

I tried this example, but is incompatible with the new version of Lucene.

UDPATE: 这是我的新代码,但它仍然无效。

推荐答案

Lucene是一个非常大的主题,需要涵盖很多类和方法,如果不了解至少一些基本概念,通常就无法使用它。如果您需要快速可用的服务,请改用 Solr 。如果您需要完全控制Lucene,请继续阅读。我将介绍一些代表它们的核心Lucene概念和类。 (有关如何读取内存中文本文件的信息,请参阅 article)。

Lucene is a quite big topic with a lot of classes and methods to cover, and you normally cannot use it without understanding at least some basic concepts. If you need a quickly available service, use Solr instead. If you need full control of Lucene, go on reading. I will cover some core Lucene concepts and classes, that represent them. (For information on how to read text files in memory read, for example, this article).

无论你在Lucene做什么 - 索引或搜索 - 你需要一个分析器。分析器的目标是将输入文本标记化(分成单词)和词干(获得单词的基础)。它还会抛出最常用的单词,如a,the等。您可以找到超过20种语言的分析器,或者您可以使用 SnowballAnalyzer 并将语言作为参数传递。

创建实例of SnowballAnalyzer for English this this:

Whatever you are going to do in Lucene - indexing or searching - you need an analyzer. The goal of analyzer is to tokenize (break into words) and stem (get base of a word) your input text. It also throws out the most frequent words like "a", "the", etc. You can find analyzers for more then 20 languages, or you can use SnowballAnalyzer and pass language as a parameter.
To create instance of SnowballAnalyzer for English you this:

Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");

如果要索引不同语言的文本,并想自动选择分析器,可以使用 tika的LanguageIdentifier

If you are going to index texts in different languages, and want to select analyzer automatically, you can use tika's LanguageIdentifier.

您需要将索引存储在某处。这有两个主要的可能性:内存索引(易于尝试)和磁盘索引(这是最常见的索引)。

使用接下来的两行中的任何一行:

You need to store your index somewhere. There's 2 major possibilities for this: in-memory index, which is easy-to-try, and disk index, which is the most widespread one.
Use any of the next 2 lines:

Directory directory = new RAMDirectory();   // RAM index storage
Directory directory = FSDirectory.open(new File("/path/to/index"));  // disk index storage

如果要添加,更新或删除文档,则需要IndexWriter:

When you want to add, update or delete document, you need IndexWriter:

IndexWriter writer = new IndexWriter(directory, analyzer, true, new IndexWriter.MaxFieldLength(25000));

任何文件(案例中的文本文件)都是一组字段。要创建包含文件信息的文档,请使用:

Any document (text file in your case) is a set of fields. To create document, which will hold information about your file, use this:

Document doc = new Document();
String title = nameOfYourFile;
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));  // adding title field
String content = contentsOfYourFile;
doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED)); // adding content field
writer.addDocument(doc);  // writing new document to the index

字段构造函数采用字段名称,文本和至少 2个参数。首先是一个标志,显示Lucene是否必须存储此字段。如果它等于 Field.Store.YES ,您将有可能从索引中获取所有文本,否则只会存储有关它的索引信息。

第二个参数显示Lucene是否必须索引此字段。对于要搜索的任何字段,请使用 Field.Index.ANALYZED

通常,您使用上述两个参数。

Field constructor takes field's name, it's text and at least 2 more parameters. First is a flag, that show whether Lucene must store this field. If it equals Field.Store.YES you will have possibility to get all your text back from the index, otherwise only index information about it will be stored.
Second parameter shows whether Lucene must index this field or not. Use Field.Index.ANALYZED for any field you are going to search on.
Normally, you use both parameters as shown above.

作业完成后,别忘了关闭 IndexWriter

Don't forget to close your IndexWriter after the job is done:

writer.close();

搜索有点棘手。您将需要几个类:查询 QueryParser 从字符串中进行Lucene查询, IndexSearcher 用于实际搜索, TopScoreDocCollector 用于存储结果(作为参数传递给 IndexSearcher )和 ScoreDoc 迭代结果。下一个片段显示了这一切是如何组成的:

Searching is a bit tricky. You will need several classes: Query and QueryParser to make Lucene query from the string, IndexSearcher for actual searching, TopScoreDocCollector to store results (it is passed to IndexSearcher as a parameter) and ScoreDoc to iterate through results. Next snippet shows how this all is composed:

IndexSearcher searcher = new IndexSearcher(directory);
QueryParser parser = new QueryParser(Version.LUCENE_30, "content", analyzer);
Query query = parser.parse("terms to search");
TopScoreDocCollector collector = TopScoreDocCollector.create(HOW_MANY_RESULTS_TO_COLLECT, true);
searcher.search(query, collector);

ScoreDoc[] hits = collector.topDocs().scoreDocs;
// `i` is just a number of document in Lucene. Note, that this number may change after document deletion 
for (int i = 0; i < hits.length; i++) {
    Document hitDoc = searcher.doc(hits[i].doc);  // getting actual document
    System.out.println("Title: " + hitDoc.get("title"));
    System.out.println("Content: " + hitDoc.get("content"));
    System.out.println();
}

注意 QueryParser的第二个参数构造函数 - 它是默认字段,即如果没有给出限定符则将搜索的字段。例如,如果您的查询是title:term,Lucene将在所有文档的字段title中搜索单词term,但如果您的查询只是term,则会在默认字段中搜索,在这种情况下 - 内容。有关详细信息,请参阅 Lucene查询语法

QueryParser 也将分析器作为最后一个参数。这必须与您用于索引文本的分析器相同。

Note second argument to the QueryParser constructor - it is default field, i.e. field that will be searched if no qualifier was given. For example, if your query is "title:term", Lucene will search for a word "term" in field "title" of all docs, but if your query is just "term" if will search in default field, in this case - "contents". For more info see Lucene Query Syntax.
QueryParser also takes analyzer as a last argument. This must be same analyzer as you used to index your text.

你必须知道的最后一件事是 TopScoreDocCollector.create 第一个参数。它只是一个数字,表示您要收集的结果数量。例如,如果它等于100,Lucene将仅收集第一个(按分数)100个结果并删除其余部分。这只是一种优化行为 - 您收集最佳结果,如果您对此不满意,则使用较大的数字重复搜索。

The last thing you must know is a TopScoreDocCollector.create first parameter. It is just a number that represents how many results you want to collect. For example, if it is equal 100, Lucene will collect only first (by score) 100 results and drop the rest. This is just an act of optimization - you collect best results, and if you're not satisfied with it, you repeat search with a larger number.

最后,不要忘记关闭搜索者和目录以免系统资源不足:

Finally, don't forget to close searcher and directory to not loose system resources:

searcher.close();
directory.close();

编辑:另见 IndexFiles演示类。 apache.org/dyn/closer.cgi/lucene/java/rel =noreferrer> Lucene 3.0来源。

Also see IndexFiles demo class from Lucene 3.0 sources.

这篇关于如何在Lucene 3.0.2中索引和搜索文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆