Solr / Lucene模糊搜索速度太慢 [英] Solr/Lucene fuzzy search too slow

查看:527
本文介绍了Solr / Lucene模糊搜索速度太慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用Solr服务器来实现位置(城市,地区,国家,对象)模糊搜索。目前,我的索引包含约0.8-1.0 M项目。它使用模糊搜索(〜0.7)真的很好,但对我来说太慢了(经常0.2-0.6秒)。使用的标记器是< tokenizer class =solr.StandardTokenizerFactory/> 。作为替代,我尝试了< tokenizer class =solr.WhitespaceTokenizerFactory/> - 它在性能方面很好(大约快100倍),但它不提供模糊搜索:(b / b)

您是否知道我可以使用的不同方法?如果可能,我希望使用模糊搜索功能,但速度要快得多。



非常感谢!

解决方案

您的问题与分析器无关你使用。当你搜索Califrna〜0.7时,Lucene会迭代索引中的所有项并计算Califrna和所有项之间的(Levenshtein)编辑距离。这是一个非常昂贵的操作。

这个问题将在Lucene 4.0版本中得到解决。Solr自带的Lucene版本不幸的是使用旧的暴力破解方法。

https:// issues。 apache.org/jira/browse/LUCENE-2089

http://java.dzone.com/news/lucenes-fuzzyquery-100-times



如果对你而言,我建议从trunk中下载Solr / Lucene并测试新的模糊查询是如何工作的。



http://wiki.apache.org/ solr / NightlyBuilds



虽然中继线很稳定,但不推荐用于生产用途。我可以建议你两种类似的方法:
$ b $ 1 SpellChecker
$ b $ p http://wiki.apache.org/solr/SpellCheckComponent



http: //www.lucidimagination.com/blog/2010/08/31/getting-started-spell-checking-with-apache-lucene-and-solr /



SpellChecker用n-gram建立它的小索引,以执行快速查找。它也使用Levenshtein距离,但不是对所有项进行迭代,而是只计算相关项的距离。



您需要先为Califrna执行拼写检查程序,它会提示您Californa。然后,您可以在您的主索引查询中使用加利福尼亚州而无需模糊查询。 b
$ b 2-自动建议

http://wiki.apache.org/solr/Suggester



您可以使用建议组件提供正确的拼写作为用户类型查询。这会快很多。它支持JaspellLookup类的模糊搜索。 JaspellLookup需要更新才能启用模糊搜索。维基并没有多说什么需要更新。如果usePrefix设置为false,它应该执行模糊查找我猜。


I am trying to implement location(cities, regions, countries, objects) fuzzy search using Solr server. Currently, my index contains about 0.8-1.0 M items. It works really well using fuzzy search (~0.7) but is too slow for me (0.2-0.6 sec very often). The tokenizer that is used is <tokenizer class="solr.StandardTokenizerFactory"/>. As an alternative I tried <tokenizer class="solr.WhitespaceTokenizerFactory"/> - it is great in terms of performance (about 100x faster) but it does not offer fuzzy search:(

Do you know any different approach I could use? I would like to benefit using fuzzy search feature but in a much faster way, if possible.

Thanks a lot!

解决方案

Your problem is not related to the analyzer that you use. When you search for Califrna~0.7 Lucene iterates over all terms in index and calculates the (Levenshtein) edit distance between "Califrna" and all terms. This is a very expensive operation.

This issue will be solved with Lucene version 4.0. Lucene version that comes with Solr is using old brute force approach unfortunately.

https: //issues.apache.org/jira/browse/LUCENE-2089

http: //java.dzone.com/news/lucenes-fuzzyquery-100-times

If it is OK for you, I would suggest to download Solr/Lucene from trunk and test how the new fuzzy query works.

http://wiki.apache.org/solr/NightlyBuilds

Even though trunk is stable it is not recommended for production use. I can suggest you two similar methods:

1 - SpellChecker

http://wiki.apache.org/solr/SpellCheckComponent

http ://www.lucidimagination.com/blog/2010/08/31/getting-started-spell-checking-with-apache-lucene-and-solr/

SpellChecker builds its small index with n-grams in order to perform fast lookup. It is also using Levenshtein distance but instead of iterating on all terms it only calculates the distance on related terms.

You need to first execute spell checker for "Califrna" and it will suggest you "Californa". Then you can use "California" in your query on your main index without fuzzy query.

2- Auto Suggest

http ://wiki.apache.org/solr/Suggester

You can offer the correct spelling as user type query with suggester component. This will be a lot faster. It support fuzzy search with JaspellLookup class. JaspellLookup needs to be updated in order to enable fuzzy search. Wiki does not say much about what needs to be updated though. if usePrefix is set to false it should perform fuzzy lookup I guess.

这篇关于Solr / Lucene模糊搜索速度太慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆