如何做到在Lucene的查询自动完成/建议吗? [英] How to do query auto-completion/suggestions in Lucene?

查看:223
本文介绍了如何做到在Lucene的查询自动完成/建议吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个方式做在Lucene的查询自动完成/建议。我GOOGLE了一下周围和周围位出场,但所有我见过的例子似乎是在Solr中设置过滤器。我们不使用Solr和不打算迁移到在不久的将来使用Solr的,和Solr显然只是围绕Lucene的包裹,无论如何,所以我想一定有办法做到这一点!

I'm looking for a way to do query auto-completion/suggestions in Lucene. I've Googled around a bit and played around a bit, but all of the examples I've seen seem to be setting up filters in Solr. We don't use Solr and aren't planning to move to using Solr in the near future, and Solr is obviously just wrapping around Lucene anyway, so I imagine there must be a way to do it!

我看着使用EdgeNGramFilter,我意识到,我必须运行索引字段筛选并获得令牌出来,然后比较他们对输入的查询......我只是努力使成位code两者之间的连接,以便帮助很多AP preciated!

I've looked into using EdgeNGramFilter, and I realise that I'd have to run the filter on the index fields and get the tokens out and then compare them against the inputted Query... I'm just struggling to make the connection between the two into a bit of code, so help is much appreciated!

要成为什么我正在寻找明确的(我意识到我并没有过于明确,抱歉) - 我正在寻找寻找一个词时,其中一个解决方案,它会返回建议查询列表。当输入INTER进入搜索领域,它会回来与建议的查询,如互联网,国际等列表。

To be clear on what I'm looking for (I realised I wasn't being overly clear, sorry) - I'm looking for a solution where when searching for a term, it'd return a list of suggested queries. When typing 'inter' into the search field, it'll come back with a list of suggested queries, such as 'internet', 'international', etc.

推荐答案

根据@Alexandre Victoor的回答,我写了基于Lucene的拼写检查器在contrib包(和使用包含在它的LuceneDictionary)一个小的类,它正是我想要的东西。

Based on @Alexandre Victoor's answer, I wrote a little class based on the Lucene Spellchecker in the contrib package (and using the LuceneDictionary included in it) that does exactly what I want.

这允许从单个字段的单一来源索引重新索引,并提供方面的建议。结果通过匹配与原始索引项的文档数进行排序,这样更通俗来讲首先出现。似乎工作pretty得好:)

This allows re-indexing from a single source index with a single field, and provides suggestions for terms. Results are sorted by the number of matching documents with that term in the original index, so more popular terms appear first. Seems to work pretty well :)

import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.ISOLatin1AccentFilter;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter;
import org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter.Side;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spell.LuceneDictionary;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

/**
 * Search term auto-completer, works for single terms (so use on the last term
 * of the query).
 * <p>
 * Returns more popular terms first.
 * 
 * @author Mat Mannion, M.Mannion@warwick.ac.uk
 */
public final class Autocompleter {

    private static final String GRAMMED_WORDS_FIELD = "words";

    private static final String SOURCE_WORD_FIELD = "sourceWord";

    private static final String COUNT_FIELD = "count";

    private static final String[] ENGLISH_STOP_WORDS = {
    "a", "an", "and", "are", "as", "at", "be", "but", "by",
    "for", "i", "if", "in", "into", "is",
    "no", "not", "of", "on", "or", "s", "such",
    "t", "that", "the", "their", "then", "there", "these",
    "they", "this", "to", "was", "will", "with"
    };

    private final Directory autoCompleteDirectory;

    private IndexReader autoCompleteReader;

    private IndexSearcher autoCompleteSearcher;

    public Autocompleter(String autoCompleteDir) throws IOException {
    	this.autoCompleteDirectory = FSDirectory.getDirectory(autoCompleteDir,
    			null);

    	reOpenReader();
    }

    public List<String> suggestTermsFor(String term) throws IOException {
    	// get the top 5 terms for query
    	Query query = new TermQuery(new Term(GRAMMED_WORDS_FIELD, term));
    	Sort sort = new Sort(COUNT_FIELD, true);

    	TopDocs docs = autoCompleteSearcher.search(query, null, 5, sort);
    	List<String> suggestions = new ArrayList<String>();
    	for (ScoreDoc doc : docs.scoreDocs) {
    		suggestions.add(autoCompleteReader.document(doc.doc).get(
    				SOURCE_WORD_FIELD));
    	}

    	return suggestions;
    }

    @SuppressWarnings("unchecked")
    public void reIndex(Directory sourceDirectory, String fieldToAutocomplete)
    		throws CorruptIndexException, IOException {
    	// build a dictionary (from the spell package)
    	IndexReader sourceReader = IndexReader.open(sourceDirectory);

    	LuceneDictionary dict = new LuceneDictionary(sourceReader,
    			fieldToAutocomplete);

    	// code from
    	// org.apache.lucene.search.spell.SpellChecker.indexDictionary(
    	// Dictionary)
    	IndexReader.unlock(autoCompleteDirectory);

    	// use a custom analyzer so we can do EdgeNGramFiltering
    	IndexWriter writer = new IndexWriter(autoCompleteDirectory,
    	new Analyzer() {
    		public TokenStream tokenStream(String fieldName,
    				Reader reader) {
    			TokenStream result = new StandardTokenizer(reader);

    			result = new StandardFilter(result);
    			result = new LowerCaseFilter(result);
    			result = new ISOLatin1AccentFilter(result);
    			result = new StopFilter(result,
    				ENGLISH_STOP_WORDS);
    			result = new EdgeNGramTokenFilter(
    				result, Side.FRONT,1, 20);

    			return result;
    		}
    	}, true);

    	writer.setMergeFactor(300);
    	writer.setMaxBufferedDocs(150);

    	// go through every word, storing the original word (incl. n-grams) 
    	// and the number of times it occurs
    	Map<String, Integer> wordsMap = new HashMap<String, Integer>();

    	Iterator<String> iter = (Iterator<String>) dict.getWordsIterator();
    	while (iter.hasNext()) {
    		String word = iter.next();

    		int len = word.length();
    		if (len < 3) {
    			continue; // too short we bail but "too long" is fine...
    		}

    		if (wordsMap.containsKey(word)) {
    			throw new IllegalStateException(
    					"This should never happen in Lucene 2.3.2");
    			// wordsMap.put(word, wordsMap.get(word) + 1);
    		} else {
    			// use the number of documents this word appears in
    			wordsMap.put(word, sourceReader.docFreq(new Term(
    					fieldToAutocomplete, word)));
    		}
    	}

    	for (String word : wordsMap.keySet()) {
    		// ok index the word
    		Document doc = new Document();
    		doc.add(new Field(SOURCE_WORD_FIELD, word, Field.Store.YES,
    				Field.Index.UN_TOKENIZED)); // orig term
    		doc.add(new Field(GRAMMED_WORDS_FIELD, word, Field.Store.YES,
    				Field.Index.TOKENIZED)); // grammed
    		doc.add(new Field(COUNT_FIELD,
    				Integer.toString(wordsMap.get(word)), Field.Store.NO,
    				Field.Index.UN_TOKENIZED)); // count

    		writer.addDocument(doc);
    	}

    	sourceReader.close();

    	// close writer
    	writer.optimize();
    	writer.close();

    	// re-open our reader
    	reOpenReader();
    }

    private void reOpenReader() throws CorruptIndexException, IOException {
    	if (autoCompleteReader == null) {
    		autoCompleteReader = IndexReader.open(autoCompleteDirectory);
    	} else {
    		autoCompleteReader.reopen();
    	}

    	autoCompleteSearcher = new IndexSearcher(autoCompleteReader);
    }

    public static void main(String[] args) throws Exception {
    	Autocompleter autocomplete = new Autocompleter("/index/autocomplete");

    	// run this to re-index from the current index, shouldn't need to do
    	// this very often
    	// autocomplete.reIndex(FSDirectory.getDirectory("/index/live", null),
    	// "content");

    	String term = "steve";

    	System.out.println(autocomplete.suggestTermsFor(term));
    	// prints [steve, steven, stevens, stevenson, stevenage]
    }

}

这篇关于如何做到在Lucene的查询自动完成/建议吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆