使用 lucene 进行模糊搜索 [英] fuzzy search with lucene
问题描述
我使用 lucene 4.3.1 实现了模糊搜索,但我对结果不满意.我想指定它应该返回的一些结果.因此,例如,如果我想要 10 个结果,它应该返回 10 个最佳匹配项,无论它们有多糟糕.大多数情况下,如果我搜索的单词与索引中的任何内容都非常不同,它不会返回任何内容.我怎样才能获得更多/更模糊的结果?
I implemented a fuzzy search with lucene 4.3.1 but i'm not satisfied with the result. I would like to specify a number of results it should return. So for example if I want 10 results, it should return the 10 best matches, no matter how bad they are. Most of the time it returns nothing if the word I search for is very different from anything in the index. How can I achieve more/fuzzier results?
这是我的代码:
public String[] luceneQuery(String query, int numberOfHits, String path)
throws ParseException, IOException {
File dir = new File(path);
Directory index = FSDirectory.open(dir);
query = query + "~";
Query q = new QueryParser(Version.LUCENE_43, "label", analyzer)
.parse(query);
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
Query fuzzyQuery = new FuzzyQuery(new Term("label", query), 2);
ScoreDoc[] fuzzyHits = searcher.search(fuzzyQuery, numberOfHits).scoreDocs;
String[] fuzzyResults = new String[fuzzyHits.length];
for (int i = 0; i < fuzzyHits.length; ++i) {
int docId = fuzzyHits[i].doc;
Document d = searcher.doc(docId);
fuzzyResults[i] = d.get("label");
}
reader.close();
return fuzzyResults;
}
推荐答案
Lucene 4.x 中的 FuzzyQuery
不再支持大编辑距离.FuzzyQuery
的当前实现是对 Lucene 3.x 实现的性能的巨大改进,但仅支持两次编辑.大于 2 Damerau–Levenshtein 编辑的距离被认为很少真正有用.
large edit distances are no longer supported by FuzzyQuery
in Lucene 4.x. The current implementation of FuzzyQuery
is a huge improvement on performance from the Lucene 3.x implementation, but only supports two edits. Distances greater than 2 Damerau–Levenshtein edits are considered to rarely be really useful.
根据FuzzyQuery
文档,如果你真的必须有更高的编辑距离:
According to the FuzzyQuery
documentation, if you really must have higher edit distances:
如果您确实需要,请考虑使用 n-gram 索引技术(例如建议模块中的 SpellChecker).
If you really want this, consider using an n-gram indexing technique (such as the SpellChecker in the suggest module) instead.
强烈的暗示是,您应该重新考虑您要完成的工作,并找到更有用的方法.
The strong implication is that you should rethink what your trying to accomplish, and find a more useful approach.
这篇关于使用 lucene 进行模糊搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!