狮身人面像和“你是说...吗?"建议的想法.它会工作吗? [英] Sphinx and "did you mean ... ?" suggestions idea. WIll it work?

查看:81
本文介绍了狮身人面像和“你是说...吗?"建议的想法.它会工作吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试提出最快的方式来提出搜索建议.起初,我认为将Levenstein UDF函数与mysql表结合使用即可完成这项工作.但是,使用levenshtein,mysql将不得不遍历表中的每一行(大量的单词),这将使查询真正变慢.

I'm trying to come up with the fastest way to make search suggestions. At first I thought a Levenstein UDF function combined with a mysql table would do the job. But using levenshtein, mysql would have to go over every row in the table (tons of words) which would make the query really slow.

现在,我最近安装并开始使用Sphinx(http://sphinxsearch.com/)进行全文搜索,主要是因为它的性能以及与SphinxSE的紧密MySQL集成.

Now I recently installed and started to use Sphinx (http://sphinxsearch.com/) for fulltext searching mainly because of its performance and tight mysql integration with SphinxSE.

所以我问自己是否可以使用狮身人面像以某种方式实现您是不是要"算法,我想我找到了一个简单的算法. 基本上,我使用所有我想纠正的关键字,在每个字母之间放置一个空格,然后将其放在狮身人面像索引中.如果单词是关键字",则该单词变为关键字".现在,当用户输入单词时,我将其拆分为字母,然后在狮身人面像索引中搜索与所提供的任何字母相匹配的记录(我只需要一个).最好的部分是,狮身人面像在计算匹配行的相关性(权重)方面非常出色,因此最佳匹配将始终具有最大权重(我认为).它还考虑了单词(在我的情况下为字母)位置,因此最佳匹配将按照该顺序进行.

So I asked myself if I can implement a "did you mean" algorithm using sphinx to boost performance somehow, and I think I found a simple one. Basically i take all the keywords I want to correct, put a space between each letter, then put it in the sphinx index. If the word is 'keyword' it becomes 'k e y w o r d'. Now when the user enters a word I split it in to letters and search in the sphinx index for a record (I just need one) that matches any of the letters provided. The best part is that sphinx is very good on calculating relevance (weight) of the matched rows, so the best match will always have the biggest weight (I think). It also accounts for word (letters in my case) positions so the best match will be in that order.

通过狮身人面像查询,我在关键字列表中得到了最相似的单词.然后,我使用php使用扩展的Levenshtain距离进行检查,该距离占重新排列字母的 http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance .如果字符串距离小于2(并且!= 0),则建议输入单词.否则,请勿提出任何建议.

With the sphinx query I get the most similar word in my keywords list. Then I check it with php using the extended Levenshtain distance which accounts for rearranged letters http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance . If the string distance is lower than 2 (and != 0) then suggest the word. Otherwise don't suggest anything.

我的想法有问题吗?我没想到的事吗?狮身人面像查询是否有任何预期的故障,狮身人面像相关性计算有什么古怪之处,但这些匹配并不能提供最佳匹配?如果我在某个地方出错,请纠正我.

Is there a problem with my idea? Something I didn't think of? Any expected glitches with the sphinx query, and quirks with the sphinx relevance calculation which woudn't give the best match? Please correct me if I'm mistaking somewhere.

推荐答案

我看不出您的想法有问题.去吧.只是指出您的方法仅在您要覆盖与LD非常相似的内置行为时才有意义.

I can't see a problem with your idea. Go for it. Just to point out that your method is only relevant if you want to override the builtin behaviour that is very similar to LD.

例如,对于sphinx 1.10-beta,您可以指定min_infix_len和expand_keywords并使用sphinx的内置加权方法(BM25和某些专有代码)来获得良好的效果. http://sphinxsearch.com/blog/2010/08/17/how-sphinx-relevance-ranking-works/

For example, with sphinx 1.10-beta, you can specify min_infix_len and expand_keywords and use sphinx's builtin weighting methods (BM25 and some proprietary code) for good results. http://sphinxsearch.com/blog/2010/08/17/how-sphinx-relevance-ranking-works/

别忘了对这些查询进行内存缓存,并创建一个热身脚本.

Don't forget to memcache these queries, and create a warm-up script.

这篇关于狮身人面像和“你是说...吗?"建议的想法.它会工作吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆