如何找到类似的两个字符串 [英] Finding how similar two strings are

查看:119
本文介绍了如何找到类似的两个字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在寻找一种算法,需要2串并给我回一个相似系数。

I'm looking for an algorithm that takes 2 strings and will give me back a "factor of similarity".

基本上,我将有一个输入可能拼写错误,所信换位等优点,而且我一定要找到最接近的匹配(ES)的,我有可能值的列表。

Basically, I will have an input that may be misspelled, have letters transposed, etc, and I have to find the closest match(es) in a list of possible values that I have.

这是不用于搜索在数据库中。我有500个左右的字符串在内存中的列表来匹配,在全部30个字符,所以可以相对较慢。

This is not for searching in a database. I'll have an in-memory list of 500 or so strings to match against, all under 30 chars, so it can be relatively slow.

我知道这存在,我以前见过它,但我不记得它的名字。

I know this exists, i've seen it before, but I can't remember its name.


编辑:感谢您指出莱文斯坦和海明。 现在,我应该实现哪一个?他们基本上是衡量不同的东西,这两者都可以用来做什么我想,但我不知道哪一个是比较合适的。

Thanks for pointing out Levenshtein and Hamming. Now, which one should I implement? They basically measure different things, both of which can be used for what I want, but I'm not sure which one is more appropriate.

我的算法读了,海明似乎明显加快。由于没有将检测两个字符被调换(即约旦和Jodran),我相信这将是一个常见的​​错误,这将是更准确的我想要什么? 谁能告诉我一些关于取舍?

I've read up on the algorithms, Hamming seems obviously faster. Since neither will detect two characters being transposed (ie. Jordan and Jodran), which I believe will be a common mistake, which will be more accurate for what I want? Can someone tell me a bit about the trade-offs?

推荐答案

好了,标准的算法是:

1)海明距离的 对于相同长度的字串只好,但很有效。基本上,它只是计算了不同的字符数。没有用的自然语言文字模糊搜索。

1) Hamming distance Only good for strings of the same length, but very efficient. Basically it simply counts the number of distinct characters. Not useful for fuzzy searching of natural language text.

2)莱文施泰因距离。 的莱文施泰因距离测量距离中转化一根弦到另一所需的操作的数量而言。这些操作包括插入,删除和字符串替换。计算莱文施泰因距离的标准方法是使用动态编程

2) Levenstein distance. The Levenstein distance measures distance in terms of the number of "operations" required to transform one string to another. These operations include insertion, deletion and substition. The standard approach of calculating the Levenstein distance is to use dynamic programming.

3)广义莱文施泰因/(Damerau-Levenshtein距离) 这个距离也考虑到字符的考虑换位在一个字,并可能是编辑距离最适合于手动输入文本的模糊匹配。计算距离的算法是更复杂的比莱文施泰因距离的位(检测换位是不容易的)。最常见的实现是的 bitap 算法(如grep)的修改。

3) Generalized Levenstein/(Damerau–Levenshtein distance) This distance also takes into consideration transpositions of characters in a word, and is probably the edit distance most suited for fuzzy matching of manually-entered text. The algorithm to compute the distance is a bit more involved than the Levenstein distance (detecting transpositions is not easy). Most common implementations are a modification of the bitap algorithm (like grep).

在一般的你可能要考虑的近邻搜索的基础上kd树某种实施的第三个方案的实施

In general you would probably want to consider an implementation of the third option implemented in some sort of nearest neighbour search based on a k-d tree

这篇关于如何找到类似的两个字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆