Google模糊搜索(又称“建议"):正在使用哪些技术? [英] Google fuzzy search (a.k.a "suggestions"): What technique(s) are in use?

查看:119
本文介绍了Google模糊搜索(又称“建议"):正在使用哪些技术?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在我的Web应用程序中实现搜索建议功能,并一直在寻找使用中的现有实现技术.

I'm implementing search suggestion functionality in my web-app, and have been looking at existing implementations for techniques in use.

似乎大多数主要网站(亚马逊,必应等)都通过以下方式实现模糊搜索:

It seems as though most of the major sites (Amazon, Bing, etc.) implement fuzzy search in the following way:

Tokenize search string in to terms
processingSearchStringSet = {}
For each term
    if exact term is NOT in index
        Get possible terms (fuzzyTerms) from levenshtein(term, 1 (or 2))
        For each term in fuzzyTerms
            if term is in index
                processingSearchStringSet.intersect(stringsIndexedByTermsSet)
    else
        processingSearchStringSet.intersect(stringsIndexedByTermsSet)

然后可能会对结果集成员按指标(例如,术语顺序保留,绝对术语位置,搜索受欢迎程度)进行排名,并根据此排名和预定的结果集大小来保留或删除,然后再传递给用户

The result set members are then presumably ranked by metrics (ex: term order preserval, absolute term location, search popularity) and preserved or eliminated based on this ranking and a pre-determined result set size before being delivered back to the user.

Google的实现与此截然不同.

Google's implementation on the other hand, differs quite a bit from this.

具体来说,它允许搜索字符串的构成项中的多个错误.错误阈值似乎取决于感兴趣项在字符串中的位置,尽管它永远不会超过7.

Specifically, it allows more than 1 error in the search string's constituent terms. The error threshhold seems to be dependant on where the term of interest is in the string, though it never exceeds 7.

有趣的是:

  1. 对Levenstein进行整体阈值为5的搜索 字词空间,对于用户字符串中的每个字词都是疯狂的 昂贵
  2. 即使完成了#1,它仍然无法解释缺少 错误的建议
  1. Conducting a Levenstein search with a threshold of 5 on the entire term space, for each term in the user's string would be insanely expensive
  2. Even if #1 is what is done, it still wouldn't explain the absence of erroneous suggestions

N-gram也没有被使用:修改术语以使其不包含原始术语中的双字母组似乎不会影响结果.

N-grams also don't see to be in use: modifying a term so that it doesn't contain an bigram present in the original term does not seem to affect the result(s).

以下是一个说明我的发现的例子:

Here's an example to illustrate my findings:

Example term: "Fiftyyyy shades of grey"

Amazon suggestions: none 
(if the error count exceeds 1 on any term, the search fails)

Bing suggestions: none
(if the error count exceeds 2 on any term, the search fails)

Google suggestions: 10 (max) 
(breaking the search would require 5 or more errors on any single term, 
or multiple errors on multiple terms)

我的问题是:这里使用哪种巫术?他们只是在使用允许大量错误的Levenshtein搜索还是在使用我不知道的另一种技术?

My question is: what type of sorcery is at work here? Are they just using a Levenshtein search with a huge error allowance, or do they use another technique I am unaware of?

推荐答案

也许您应该尝试这种方法: http ://norvig.com/spell-correct.html

Maybe you should try this approach: http://norvig.com/spell-correct.html

这篇关于Google模糊搜索(又称“建议"):正在使用哪些技术?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆