Google模糊搜索(又称“建议"):正在使用哪些技术? [英] Google fuzzy search (a.k.a "suggestions"): What technique(s) are in use?
问题描述
我正在我的Web应用程序中实现搜索建议功能,并一直在寻找使用中的现有实现技术.
I'm implementing search suggestion functionality in my web-app, and have been looking at existing implementations for techniques in use.
似乎大多数主要网站(亚马逊,必应等)都通过以下方式实现模糊搜索:
It seems as though most of the major sites (Amazon, Bing, etc.) implement fuzzy search in the following way:
Tokenize search string in to terms
processingSearchStringSet = {}
For each term
if exact term is NOT in index
Get possible terms (fuzzyTerms) from levenshtein(term, 1 (or 2))
For each term in fuzzyTerms
if term is in index
processingSearchStringSet.intersect(stringsIndexedByTermsSet)
else
processingSearchStringSet.intersect(stringsIndexedByTermsSet)
然后可能会对结果集成员按指标(例如,术语顺序保留,绝对术语位置,搜索受欢迎程度)进行排名,并根据此排名和预定的结果集大小来保留或删除,然后再传递给用户
The result set members are then presumably ranked by metrics (ex: term order preserval, absolute term location, search popularity) and preserved or eliminated based on this ranking and a pre-determined result set size before being delivered back to the user.
Google的实现与此截然不同.
Google's implementation on the other hand, differs quite a bit from this.
具体来说,它允许搜索字符串的构成项中的多个错误.错误阈值似乎取决于感兴趣项在字符串中的位置,尽管它永远不会超过7.
Specifically, it allows more than 1 error in the search string's constituent terms. The error threshhold seems to be dependant on where the term of interest is in the string, though it never exceeds 7.
有趣的是:
- 对Levenstein进行整体阈值为5的搜索 字词空间,对于用户字符串中的每个字词都是疯狂的 昂贵
- 即使完成了#1,它仍然无法解释缺少 错误的建议
- Conducting a Levenstein search with a threshold of 5 on the entire term space, for each term in the user's string would be insanely expensive
- Even if #1 is what is done, it still wouldn't explain the absence of erroneous suggestions
N-gram也没有被使用:修改术语以使其不包含原始术语中的双字母组似乎不会影响结果.
N-grams also don't see to be in use: modifying a term so that it doesn't contain an bigram present in the original term does not seem to affect the result(s).
以下是一个说明我的发现的例子:
Here's an example to illustrate my findings:
Example term: "Fiftyyyy shades of grey"
Amazon suggestions: none
(if the error count exceeds 1 on any term, the search fails)
Bing suggestions: none
(if the error count exceeds 2 on any term, the search fails)
Google suggestions: 10 (max)
(breaking the search would require 5 or more errors on any single term,
or multiple errors on multiple terms)
我的问题是:这里使用哪种巫术?他们只是在使用允许大量错误的Levenshtein搜索还是在使用我不知道的另一种技术?
My question is: what type of sorcery is at work here? Are they just using a Levenshtein search with a huge error allowance, or do they use another technique I am unaware of?
推荐答案
也许您应该尝试这种方法: http ://norvig.com/spell-correct.html
Maybe you should try this approach: http://norvig.com/spell-correct.html
这篇关于Google模糊搜索(又称“建议"):正在使用哪些技术?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!