Lucene:多词短语作为搜索词 [英] Lucene: Multi-word phrases as search terms

查看:113
本文介绍了Lucene:多词短语作为搜索词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Apache Lucene创建一个可搜索的电话/本地商业目录。

I'm trying to make a searchable phone/local business directory using Apache Lucene.

我有街道名称,公司名称,电话号码等字段。我遇到的问题是,当我尝试在街道上搜索街道名称有多个单词(例如新月)时,不会返回任何结果。但如果我尝试用一​​个单词搜索,例如'crescent',我会得到我想要的所有结果。

I have fields for street name, business name, phone number etc. The problem that I'm having is that when I try to search by street where the street name has multiple words (e.g. 'the crescent'), no results are returned. But if I try to search with just one word, e.g 'crescent', I get all the results that I want.

我正在使用以下内容索引数据:

I'm indexing the data with the following:

String LocationOfDirectory = "C:\\dir\\index";

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
Directory Index = new SimpleFSDirectory(LocationOfDirectory);

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE.34, analyzer);
IndexWriter w = new IndexWriter(index, config);


Document doc = new Document();
doc.add(new Field("Street", "the crescent", Field.Store.YES, Field.Index.Analyzed);

w.add(doc);
w.close();

我的搜索工作如下:

int numberOfHits = 200;
String LocationOfDirectory = "C:\\dir\\index";
TopScoreDocCollector collector = TopScoreDocCollector.create(numberOfHits, true);
Directory directory = new SimpleFSDirectory(new File(LocationOfDirectory));
IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory);

WildcardQuery q = new WildcardQuery(new Term("Street", "the crescent");

searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;

我尝试过交换通配符查询短语查询,首先是整个字符串,然后在白色空间上拆分字符串并将它们包装在这样的BooleanQuery中:

I have tried swapping the wildcard query for a phrase query, first with the entire string and then splitting the string up on white space and wrapping them in a BooleanQuery like this:

String term = "the crescent";
BooleanQuery b = new BooleanQuery();
PhraseQuery p = new PhraseQuery();
String[] tokens = term.split(" ");
for(int i = 0 ; i < tokens.length ; ++i)
{
    p.add(new Term("Street", tokens[i]));
}
b.add(p, BooleanClause.Occur.MUST);

然而,这不起作用。我尝试使用KeywordAnalyzer而不是StandardAnalyzer,但随后所有其他类型的搜索也停止了。我尝试用其他字符(+和@)替换空格,并将查询转换为此表单,但仍然无效。我认为它不起作用,因为+和@是没有索引的特殊字符,但我似乎无法找到任何字符都是这样的列表。

However, this didn't work. I tried using a KeywordAnalyzer instead of a StandardAnalyzer, but then all other types of search stopped working as well. I have tried replacing spaces with other characters (+ and @), and converting queries to and from this form, but that still doesn't work. I think it doesn't work because + and @ are special characters which are not indexed, but I can't seem to find a list anywhere of which characters are like that.

我开始有点生气,有人知道我做错了吗?

I'm beginning to go slightly mad, does anyone know what I'm doing wrong?

谢谢,
Rik

Thanks, Rik

推荐答案

我发现我不使用QueryParser生成查询的尝试无效,所以我停止尝试创建自己的查询并改为使用QueryParser。我在网上看到的所有建议表明你应该在索引编制过程中使用的QueryParser中使用相同的Analyzer,所以我使用了StandardAnalyzer来构建QueryParser。

I found that my attempt to generate a query without using a QueryParser was not working, so I stopped trying to create my own queries and used a QueryParser instead. All of the recomendations that I saw online showed that you should use the same Analyzer in the QueryParser that you use during indexing, so I used a StandardAnalyzer to build the QueryParser.

这适用于此示例,因为StandardAnalyzer在索引期间从街道the crescent中删除单词the,因此我们无法搜索它,因为它不在索引中。

This works on this example because the StandardAnalyzer removes the word "the" from the street "the crescent" during indexing, and hence we can't search for it because it isn't in the index.

但是,如果我们选择搜索Grove Road,我们就会遇到开箱即用功能的问题,即查询将全部返回结果包含Grove或Road。通过设置QueryParser可以很容易地解决这个问题,因此它的默认操作是AND而不是OR。

However, if we choose to search for "Grove Road", we have a problem with the out-of-the-box functionality, namely that the query will return all of the results containing either "Grove" OR "Road". This is easily fixed by setting up the QueryParser so that it's default operation is AND instead of OR.

最后,正确的解决方案如下:

In the end, the correct solution was the following:

int numberOfHits = 200;
String LocationOfDirectory = "C:\\dir\\index";
TopScoreDocCollector collector = TopScoreDocCollector.create(numberOfHits, true);
Directory directory = new SimpleFSDirectory(new File(LocationOfDirectory));
IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory);

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);

//WildcardQuery q = new WildcardQuery(new Term("Street", "the crescent");
QueryParser qp = new QueryParser(Version.LUCENE_35, "Street", analyzer);
qp.setDefaultOperator(QueryParser.Operator.AND);

Query q = qp.parse("grove road");

searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;

这篇关于Lucene:多词短语作为搜索词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆