Google模糊搜索(又称“建议"):正在使用哪些技术? [英] Google fuzzy search (a.k.a "suggestions"): What technique(s) are in use?

查看：119 发布时间：2020/4/27 3:47:46 algorithm search language-agnostic autocomplete fuzzy-search

本文介绍了Google模糊搜索(又称“建议"):正在使用哪些技术?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在我的Web应用程序中实现搜索建议功能，并一直在寻找使用中的现有实现技术.

I'm implementing search suggestion functionality in my web-app, and have been looking at existing implementations for techniques in use.

似乎大多数主要网站(亚马逊，必应等)都通过以下方式实现模糊搜索:

It seems as though most of the major sites (Amazon, Bing, etc.) implement fuzzy search in the following way:

Tokenize search string in to terms
processingSearchStringSet = {}
For each term
    if exact term is NOT in index
        Get possible terms (fuzzyTerms) from levenshtein(term, 1 (or 2))
        For each term in fuzzyTerms
            if term is in index
                processingSearchStringSet.intersect(stringsIndexedByTermsSet)
    else
        processingSearchStringSet.intersect(stringsIndexedByTermsSet)

然后可能会对结果集成员按指标(例如，术语顺序保留，绝对术语位置，搜索受欢迎程度)进行排名，并根据此排名和预定的结果集大小来保留或删除，然后再传递给用户

The result set members are then presumably ranked by metrics (ex: term order preserval, absolute term location, search popularity) and preserved or eliminated based on this ranking and a pre-determined result set size before being delivered back to the user.

Google的实现与此截然不同.

Google's implementation on the other hand, differs quite a bit from this.

具体来说，它允许搜索字符串的构成项中的多个错误.错误阈值似乎取决于感兴趣项在字符串中的位置，尽管它永远不会超过7.

Specifically, it allows more than 1 error in the search string's constituent terms. The error threshhold seems to be dependant on where the term of interest is in the string, though it never exceeds 7.

有趣的是:

对Levenstein进行整体阈值为5的搜索字词空间，对于用户字符串中的每个字词都是疯狂的昂贵
即使完成了#1，它仍然无法解释缺少错误的建议

Conducting a Levenstein search with a threshold of 5 on the entire term space, for each term in the user's string would be insanely expensive
Even if #1 is what is done, it still wouldn't explain the absence of erroneous suggestions

N-gram也没有被使用:修改术语以使其不包含原始术语中的双字母组似乎不会影响结果.

N-grams also don't see to be in use: modifying a term so that it doesn't contain an bigram present in the original term does not seem to affect the result(s).

以下是一个说明我的发现的例子:

Here's an example to illustrate my findings:

Example term: "Fiftyyyy shades of grey"

Amazon suggestions: none 
(if the error count exceeds 1 on any term, the search fails)

Bing suggestions: none
(if the error count exceeds 2 on any term, the search fails)

Google suggestions: 10 (max) 
(breaking the search would require 5 or more errors on any single term, 
or multiple errors on multiple terms)

我的问题是:这里使用哪种巫术?他们只是在使用允许大量错误的Levenshtein搜索还是在使用我不知道的另一种技术?

My question is: what type of sorcery is at work here? Are they just using a Levenshtein search with a huge error allowance, or do they use another technique I am unaware of?

Google模糊搜索(又称“建议"):正在使用哪些技术? [英] Google fuzzy search (a.k.a "suggestions"): What technique(s) are in use?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Google模糊搜索(又称“建议"):正在使用哪些技术? [英] Google fuzzy search (a.k.a &quot;suggestions&quot;): What technique(s) are in use?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

Google模糊搜索(又称“建议"):正在使用哪些技术? [英] Google fuzzy search (a.k.a "suggestions"): What technique(s) are in use?

登录关闭