Solr/Lucene 模糊搜索太慢 [英] Solr/Lucene fuzzy search too slow

查看:42
本文介绍了Solr/Lucene 模糊搜索太慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Solr 服务器实现位置(城市、地区、国家、对象)模糊搜索.目前,我的索引包含大约 0.8-1.0 M 项.使用模糊搜索(~0.7)效果很好,但对我来说太慢了(经常是 0.2-0.6 秒).使用的标记器是 .作为替代方案,我尝试了 <tokenizer class="solr.WhitespaceTokenizerFactory"/> - 它在性能方面很棒(大约快 100 倍),但它不提供模糊搜索:(

I am trying to implement location(cities, regions, countries, objects) fuzzy search using Solr server. Currently, my index contains about 0.8-1.0 M items. It works really well using fuzzy search (~0.7) but is too slow for me (0.2-0.6 sec very often). The tokenizer that is used is <tokenizer class="solr.StandardTokenizerFactory"/>. As an alternative I tried <tokenizer class="solr.WhitespaceTokenizerFactory"/> - it is great in terms of performance (about 100x faster) but it does not offer fuzzy search:(

你知道我可以使用什么不同的方法吗?如果可能,我希望使用模糊搜索功能受益,但速度要快得多.

Do you know any different approach I could use? I would like to benefit using fuzzy search feature but in a much faster way, if possible.

非常感谢!

推荐答案

您的问题与您使用的分析器无关.当您搜索 Califrna~0.7 时,Lucene 会遍历索引中的所有术语并计算Califrna"与所有术语之间的 (Levenshtein) 编辑距离.这是一项非常昂贵的操作.

Your problem is not related to the analyzer that you use. When you search for Califrna~0.7 Lucene iterates over all terms in index and calculates the (Levenshtein) edit distance between "Califrna" and all terms. This is a very expensive operation.

此问题将在 Lucene 4.0 版中解决.不幸的是,Solr 附带的 Lucene 版本正在使用旧的蛮力方法.

This issue will be solved with Lucene version 4.0. Lucene version that comes with Solr is using old brute force approach unfortunately.

https://issues.apache.org/jira/browse/LUCENE-2089

https: //issues.apache.org/jira/browse/LUCENE-2089

http://java.dzone.com/news/lucenes-fuzzyquery-100-times

http: //java.dzone.com/news/lucenes-fuzzyquery-100-times

如果你觉得没问题,我建议你从主干下载 Solr/Lucene 并测试新的模糊查询是如何工作的.

If it is OK for you, I would suggest to download Solr/Lucene from trunk and test how the new fuzzy query works.

http://wiki.apache.org/solr/NightlyBuilds

即使主干稳定,也不推荐用于生产用途.我可以向您推荐两种类似的方法:

Even though trunk is stable it is not recommended for production use. I can suggest you two similar methods:

1 - 拼写检查器

http://wiki.apache.org/solr/SpellCheckComponent

http://www.lucidimagination.com/blog/2010/08/31/getting-started-spell-checking-with-apache-lucene-and-solr/

http ://www.lucidimagination.com/blog/2010/08/31/getting-started-spell-checking-with-apache-lucene-and-solr/

SpellChecker 使用 n-gram 构建其小索引以执行快速查找.它也使用 Levenshtein 距离,但不是迭代所有术语,它只计算相关术语的距离.

SpellChecker builds its small index with n-grams in order to perform fast lookup. It is also using Levenshtein distance but instead of iterating on all terms it only calculates the distance on related terms.

您需要先对Califrna"执行拼写检查,它会提示您Californa".然后,您可以在主索引的查询中使用加利福尼亚",而无需进行模糊查询.

You need to first execute spell checker for "Califrna" and it will suggest you "Californa". Then you can use "California" in your query on your main index without fuzzy query.

2- 自动建议

http://wiki.apache.org/solr/Suggester

http ://wiki.apache.org/solr/Suggester

您可以使用建议组件提供正确的拼写作为用户类型查询.这会快很多.它支持使用 JaspellLookup 类进行模糊搜索.JaspellLookup 需要更新才能启用模糊搜索.Wiki 并没有说明需要更新的内容.如果 usePrefix 设置为 false 它应该执行模糊查找我猜.

You can offer the correct spelling as user type query with suggester component. This will be a lot faster. It support fuzzy search with JaspellLookup class. JaspellLookup needs to be updated in order to enable fuzzy search. Wiki does not say much about what needs to be updated though. if usePrefix is set to false it should perform fuzzy lookup I guess.

这篇关于Solr/Lucene 模糊搜索太慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆