Solr / Lucene模糊搜索速度太慢 [英] Solr/Lucene fuzzy search too slow
问题描述
< tokenizer class =solr.StandardTokenizerFactory/>
。作为替代,我尝试了< tokenizer class =solr.WhitespaceTokenizerFactory/>
- 它在性能方面很好(大约快100倍),但它不提供模糊搜索:(b / b)您是否知道我可以使用的不同方法?如果可能,我希望使用模糊搜索功能,但速度要快得多。
非常感谢!
您的问题与分析器无关你使用。当你搜索Califrna〜0.7时,Lucene会迭代索引中的所有项并计算Califrna和所有项之间的(Levenshtein)编辑距离。这是一个非常昂贵的操作。
这个问题将在Lucene 4.0版本中得到解决。Solr自带的Lucene版本不幸的是使用旧的暴力破解方法。
https:// issues。 apache.org/jira/browse/LUCENE-2089
http://java.dzone.com/news/lucenes-fuzzyquery-100-times
如果对你而言,我建议从trunk中下载Solr / Lucene并测试新的模糊查询是如何工作的。
http://wiki.apache.org/ solr / NightlyBuilds
虽然中继线很稳定,但不推荐用于生产用途。我可以建议你两种类似的方法:
$ b $ 1 SpellChecker
$ b $ p http://wiki.apache.org/solr/SpellCheckComponent
http: //www.lucidimagination.com/blog/2010/08/31/getting-started-spell-checking-with-apache-lucene-and-solr /
SpellChecker用n-gram建立它的小索引,以执行快速查找。它也使用Levenshtein距离,但不是对所有项进行迭代,而是只计算相关项的距离。
您需要先为Califrna执行拼写检查程序,它会提示您Californa。然后,您可以在您的主索引查询中使用加利福尼亚州而无需模糊查询。 b
$ b 2-自动建议
http://wiki.apache.org/solr/Suggester
您可以使用建议组件提供正确的拼写作为用户类型查询。这会快很多。它支持JaspellLookup类的模糊搜索。 JaspellLookup需要更新才能启用模糊搜索。维基并没有多说什么需要更新。如果usePrefix设置为false,它应该执行模糊查找我猜。
I am trying to implement location(cities, regions, countries, objects) fuzzy search using Solr server. Currently, my index contains about 0.8-1.0 M items. It works really well using fuzzy search (~0.7) but is too slow for me (0.2-0.6 sec very often). The tokenizer that is used is <tokenizer class="solr.StandardTokenizerFactory"/>
. As an alternative I tried <tokenizer class="solr.WhitespaceTokenizerFactory"/>
- it is great in terms of performance (about 100x faster) but it does not offer fuzzy search:(
Do you know any different approach I could use? I would like to benefit using fuzzy search feature but in a much faster way, if possible.
Thanks a lot!
Your problem is not related to the analyzer that you use. When you search for Califrna~0.7 Lucene iterates over all terms in index and calculates the (Levenshtein) edit distance between "Califrna" and all terms. This is a very expensive operation.
This issue will be solved with Lucene version 4.0. Lucene version that comes with Solr is using old brute force approach unfortunately.
https: //issues.apache.org/jira/browse/LUCENE-2089
http: //java.dzone.com/news/lucenes-fuzzyquery-100-times
If it is OK for you, I would suggest to download Solr/Lucene from trunk and test how the new fuzzy query works.
http://wiki.apache.org/solr/NightlyBuilds
Even though trunk is stable it is not recommended for production use. I can suggest you two similar methods:
1 - SpellChecker
http://wiki.apache.org/solr/SpellCheckComponent
http ://www.lucidimagination.com/blog/2010/08/31/getting-started-spell-checking-with-apache-lucene-and-solr/
SpellChecker builds its small index with n-grams in order to perform fast lookup. It is also using Levenshtein distance but instead of iterating on all terms it only calculates the distance on related terms.
You need to first execute spell checker for "Califrna" and it will suggest you "Californa". Then you can use "California" in your query on your main index without fuzzy query.
2- Auto Suggest
http ://wiki.apache.org/solr/Suggester
You can offer the correct spelling as user type query with suggester component. This will be a lot faster. It support fuzzy search with JaspellLookup class. JaspellLookup needs to be updated in order to enable fuzzy search. Wiki does not say much about what needs to be updated though. if usePrefix is set to false it should perform fuzzy lookup I guess.
这篇关于Solr / Lucene模糊搜索速度太慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!