模糊搜索算法(近似字符串匹配算法) [英] Fuzzy search algorithm (approximate string matching algorithm)

查看:1080
本文介绍了模糊搜索算法(近似字符串匹配算法)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望创建一个模糊搜索算法。
但是,经过数小时的研究,我真的很努力。



我想创建一种算法,对学校名称列表进行模糊搜索。 / p>

这是我到目前为止所看到的:



我的大部分研究继续指向Google和Stackoverflow上的 字符串指标,例如:




  • Levenshtein距离

  • Damerau-Levenshtein距离

  • Needleman–Wunsch算法



但是,这仅给出了两个字符串相似的得分。我可以想到的将其实现为搜索算法的唯一方法是执行线性搜索,并对每个字符串执行字符串度量算法,然后返回分数高于特定阈值的字符串。 (本来我的字符串存储在一个trie树中,但这显然对我没有帮助!)



尽管对于小名单来说,这并不是一个坏主意,对于说有100,000个名称的列表而言,这是有问题的,并且用户执行了许多查询。



我查看的另一种算法是 Spell-checker方法,您只需在其中搜索所有潜在的拼写错误。但是,这也是非常低效的,因为它的长度为7的单词需要75,000个单词,而错误计数仅为2。



我需要什么?



有人可以建议我一个高效的模糊搜索算法




  • 算法名称

  • 其工作原理或指向其工作原理的链接

  • 优点和缺点以及何时最佳使用(可选)



我知道所有算法都会有其优点和缺点,并且没有 best 算法。

解决方案

考虑到您正在尝试在学校名称列表上进行模糊搜索,我认为您不希望使用Levenshtein距离之类的传统字符串相似性。我的假设是,您正在接受用户的输入(键盘输入或通过电话说出),并且想要快速找到匹配的学校。



距离度量标准您将根据替换,删除和插入两个字符串的相似程度。但是这些算法并没有真正告诉您任何关于字符串在人类语言中作为单词的相似程度的信息。



例如,请考虑一下,单词史密斯,史密斯和烟熏。我可以通过两个步骤从 smythe转到 smith:

  smythe->史密斯-> smith 

从 smote到 smith分两个步骤:

  smote-> -> smith 

因此,两者的距离与字符串相同,但与 words ,它们有很大的不同。如果有人(说语言)告诉您他正在寻找 Symthe College,那么您几乎可以肯定地说,哦,我想您是说史密斯。但是,如果有人说 Smote College,那么您根本不知道他在说什么。



您需要的是语音算法,例如 Soundex Metaphone 。基本上,这些算法将一个单词分解为音素,并表示该单词在口语中的发音方式。然后,您可以将结果与已知单词列表进行比较以找到匹配项。



这样的系统比使用距离要快得多。指标。考虑到使用距离度量标准,您需要将用户输入的内容与列表中的每个单词进行比较,以获取距离。这在计算上很昂贵,而且正如我用史密斯和冒烟所演示的那样,结果可能很糟糕。



使用语音算法,您可以创建的音素表示将每个已知单词放入字典中(哈希图或Trie)。这是一次性的启动费用。然后,每当用户输入搜索词时,您就创建其输入的音素表示并在词典中查找它。



还要考虑一下,当人们拼写错误的专有名称时,他们几乎总是能正确输入第一个字母,而且经常会发音不正确。拼写错误的听起来像他们试图拼写的实际单词。如果真是这样,那么语音算法绝对是必经之路。


I wish to create a fuzzy search algorithm. However, upon hours of research I am really struggling.

I want to create an algorithm that performs a fuzzy search on a list of names of schools.

This is what I have looked at so far:

Most of my research keep pointing to "string metrics" on Google and Stackoverflow such as:

  • Levenshtein distance
  • Damerau-Levenshtein distance
  • Needleman–Wunsch algorithm

However this just gives a score of how similar 2 strings are. The only way I can think of implementing it as a search algorithm is to perform a linear search and executing the string metric algorithm for each string and returning the strings with scores above a certain threshold. (Originally I had my strings stored in a trie tree, but this obviously won't help me here!)

Although this is not such a bad idea for small lists, it would be problematic for lists with lets say a 100,000 names, and the user performed many queries.

Another algorithm I looked at is the Spell-checker method, where you just do a search for all potential misspellings. However this also is highly inefficient as it requires more than 75,000 words for a word of length 7 and error count of just 2.

What I need?

Can someone please suggest me a good efficient fuzzy search algorithm. with:

  • Name of the algorithm
  • How it works or a link to how it works
  • Pro's and cons and when it's best used (optional)

I understand that all algorithms will have their pros and cons and there is no best algorithm.

解决方案

Considering that you're trying to do a fuzzy search on a list of school names, I don't think you want to go for traditional string similarity like Levenshtein distance. My assumption is that you're taking a user's input (either keyboard input or spoken over the phone), and you want to quickly find the matching school.

Distance metrics tell you how similar two strings are based on substitutions, deletions, and insertions. But those algorithms don't really tell you anything about how similar the strings are as words in a human language.

Consider, for example, the words "smith," "smythe," and "smote". I can go from "smythe" to "smith" in two steps:

smythe -> smithe -> smith

And from "smote" to "smith" in two steps:

smote -> smite -> smith

So the two have the same distance as strings, but as words, they're significantly different. If somebody told you (spoken language) that he was looking for "Symthe College," you'd almost certainly say, "Oh, I think you mean Smith." But if somebody said "Smote College," you wouldn't have any idea what he was talking about.

What you need is a phonetic algorithm like Soundex or Metaphone. Basically, those algorithms break a word down into phonemes and create a representation of how the word is pronounced in spoken language. You can then compare the result against a known list of words to find a match.

Such a system would be much faster than using a distance metric. Consider that with a distance metric, you need to compare the user's input with every word in your list to obtain the distance. That is computationally expensive and the results, as I demonstrated with "smith" and "smote" can be laughably bad.

Using a phonetic algorithm, you create the phoneme representation of each of your known words and place it in a dictionary (a hash map or possibly a trie). That's a one-time startup cost. Then, whenever the user inputs a search term, you create the phoneme representation of his input and look it up in your dictionary. That is a lot faster and produces much better results.

Consider also that when people misspell proper names, they almost always get the first letter right, and more often than not pronouncing the misspelling sounds like the actual word they were trying to spell. If that's the case, then the phonetic algorithms are definitely the way to go.

这篇关于模糊搜索算法(近似字符串匹配算法)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆