从Lucene找到搜索命中的位置 [英] Finding the position of search hits from Lucene
问题描述
使用Lucene,在搜索结果中找到匹配项的推荐方法是什么?
With Lucene, what would be the recommended approach for locating matches in search results?
更具体地说,假设索引文档有一个字段fullText,用于存储平原 - 某些文件的文字内容。此外,假设对于这些文件中的一个,内容是快速的棕色狐狸跳过懒狗。接下来,搜索狐狸狗。显然,这份文件很受欢迎。
More specifically, suppose index documents have a field "fullText" which stores the plain-text content of some document. Furthermore, assume that for one of these documents the content is "The quick brown fox jumps over the lazy dog". Next a search is performed for "fox dog". Obviously, the document would be a hit.
在这种情况下,Lucene可以用来提供类似匹配区域的内容吗?所以对于这种情况,我想产生类似的东西:
In this scenario, can Lucene be used to provide something like the matching regions for found document? So for this scenario I would like to produce something like:
[{match: "fox", startIndex: 10, length: 3},
{match: "dog", startIndex: 34, length: 3}]
<我怀疑它可以通过org.apache.lucene.search.highlight包中提供的内容来实现。我不确定整体方法...
I suspect that it could be implemented by what's provided in the org.apache.lucene.search.highlight package. I'm not sure about the overall approach though...
推荐答案
我使用的是TermFreqVector。这是一个工作演示,它打印术语位置,以及开始和结束术语索引:
TermFreqVector is what I used. Here is a working demo, that prints both the term positions, and the starting and ending term indexes:
public class Search {
public static void main(String[] args) throws IOException, ParseException {
Search s = new Search();
s.doSearch(args[0], args[1]);
}
Search() {
}
public void doSearch(String db, String querystr) throws IOException, ParseException {
// 1. Specify the analyzer for tokenizing text.
// The same analyzer should be used as was used for indexing
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
Directory index = FSDirectory.open(new File(db));
// 2. query
Query q = new QueryParser(Version.LUCENE_CURRENT, "contents", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexSearcher searcher = new IndexSearcher(index, true);
IndexReader reader = IndexReader.open(index, true);
searcher.setDefaultFieldSortScoring(true, false);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display term positions, and term indexes
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
TermFreqVector tfvector = reader.getTermFreqVector(docId, "contents");
TermPositionVector tpvector = (TermPositionVector)tfvector;
// this part works only if there is one term in the query string,
// otherwise you will have to iterate this section over the query terms.
int termidx = tfvector.indexOf(querystr);
int[] termposx = tpvector.getTermPositions(termidx);
TermVectorOffsetInfo[] tvoffsetinfo = tpvector.getOffsets(termidx);
for (int j=0;j<termposx.length;j++) {
System.out.println("termpos : "+termposx[j]);
}
for (int j=0;j<tvoffsetinfo.length;j++) {
int offsetStart = tvoffsetinfo[j].getStartOffset();
int offsetEnd = tvoffsetinfo[j].getEndOffset();
System.out.println("offsets : "+offsetStart+" "+offsetEnd);
}
// print some info about where the hit was found...
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("path"));
}
// searcher can only be closed when there
// is no need to access the documents any more.
searcher.close();
}
}
这篇关于从Lucene找到搜索命中的位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!