有哪些算法用于比较两个字符串的相似程度? [英] What are some algorithms for comparing how similar two strings are?

查看:819
本文介绍了有哪些算法用于比较两个字符串的相似程度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要比较C ++字符串来决定是否重新present同样的事情。这涉及到区分人类进入其中,缩写等小细节可能有所不同标题。例如,请考虑以下两个标题:

I need to compare strings in C++ to decide whether they represent the same thing. This relates to case titles entered by humans where abbreviations and other small details may differ. For example, consider the following two titles:

std::string first = "Henry C. Harper v. The Law Offices of Huey & Luey, LLP";

至于反对:

std::string second = "Harper v. The Law Offices of Huey & Luey, LLP";

一个人能够快速评估,这些都是最有可能是同一个。我已经采取目前的做法是由lowercasing所有的字母和删除所有标点符号和空格给正常化的字符串:

A human can quickly gauge that these are most likely one and the same. The current approach I have taken is to normalize the strings by lowercasing all letters and removing all punctuation and spaces giving:

std::string firstNormalized = "henrycharpervthelawofficesofhueylueyllp";

std::string secondNormalized = "harpervthelawofficesofhueylueyllp";

在这种情况下比较,一个是另一个的子序列,但可以想像其它更复杂的变体,其中,这并不一定发生,但它们具有显著子序列中常见的。也有可能是偶然的人录入错误,如调换字母和拼写错误。

Comparing in this case, one is a sub-sequence of the other, but you can imagine other more complex variations where that does not necessarily occur, yet they have significant sub-sequences in common. There could also be occasional human entry errors such as transposed letters and spelling errors.

也许,某种性格差异程序可以帮助?我见过好线的diff程序比较在code差异进行检查的,是有这样的事情在一个角色的基础,也许在提升?如果你能数的连续字符共同的数量,并采取比取消共享的人物,也许这将是一个很好的启发?

Perhaps some kind of character diff program could help? I've seen good line diff programs for comparing differences in code to be checked in, is there something like that on a character basis, maybe in boost? If you could count the number of consecutive characters in common and take the ratio to the characters unshared, perhaps that would be a good heuristic?

在最后,我需要一个布尔决定是否要考虑他们同样与否。这并不一定是完美的,但它应该理想地很少是错误的。

In the end, I need a Boolean decision as to whether to consider them the same or not. It doesn't have to be perfect, but it should ideally rarely be wrong.

我可以使用哪些算法,会给我一些量化为两个字符串的相似程度给对方,我将其转换成一个肯定的/一些启发式的方法没有答案?

What algorithm can I use that will give me some kind of quantification as to how similar the two strings are to each other which I can then convert into a yes/no answer by way of some heuristic?

推荐答案

您正在寻找的被称为字符串公制<什么/ A >算法。有一个的显著的若干人,许多具有类似特征。其中较受欢​​迎的:

What you're looking for are called String Metric algorithms. There a significant number of them, many with similar characteristics. Among the more popular:

  • Levenshtein距离 :来改变一个字需要单字符编辑的最小数入另一个。字符串不必是相同长度
  • 汉明距离 :这是两个相等长度的字符串不同的字符数。
  • 史密斯 - 沃特曼 :一种用于计算变量子序列算法家族相似的地方。
  • dice系数 :相似的算法,计算相邻的差异系数字符对。

看一看这些以及关于该主题的维基页面别人。

Have a look at these as well as others on the wiki page on the topic.

这篇关于有哪些算法用于比较两个字符串的相似程度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆