有哪些算法用于比较两个字符串的相似程度？ [英] What are some algorithms for comparing how similar two strings are?

查看：819 发布时间：2015/11/30 16:02:18 algorithm string-comparison stdstring heuristics

本文介绍了有哪些算法用于比较两个字符串的相似程度？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要比较C ++字符串来决定是否重新present同样的事情。这涉及到区分人类进入其中，缩写等小细节可能有所不同标题。例如，请考虑以下两个标题：

I need to compare strings in C++ to decide whether they represent the same thing. This relates to case titles entered by humans where abbreviations and other small details may differ. For example, consider the following two titles:

std::string first = "Henry C. Harper v. The Law Offices of Huey & Luey, LLP";

至于反对：

std::string second = "Harper v. The Law Offices of Huey & Luey, LLP";

一个人能够快速评估，这些都是最有可能是同一个。我已经采取目前的做法是由lowercasing所有的字母和删除所有标点符号和空格给正常化的字符串：

A human can quickly gauge that these are most likely one and the same. The current approach I have taken is to normalize the strings by lowercasing all letters and removing all punctuation and spaces giving:

std::string firstNormalized = "henrycharpervthelawofficesofhueylueyllp";

和

std::string secondNormalized = "harpervthelawofficesofhueylueyllp";

在这种情况下比较，一个是另一个的子序列，但可以想像其它更复杂的变体，其中，这并不一定发生，但它们具有显著子序列中常见的。也有可能是偶然的人录入错误，如调换字母和拼写错误。

Comparing in this case, one is a sub-sequence of the other, but you can imagine other more complex variations where that does not necessarily occur, yet they have significant sub-sequences in common. There could also be occasional human entry errors such as transposed letters and spelling errors.

也许，某种性格差异程序可以帮助？我见过好线的diff程序比较在code差异进行检查的，是有这样的事情在一个角色的基础，也许在提升？如果你能数的连续字符共同的数量，并采取比取消共享的人物，也许这将是一个很好的启发？

Perhaps some kind of character diff program could help? I've seen good line diff programs for comparing differences in code to be checked in, is there something like that on a character basis, maybe in boost? If you could count the number of consecutive characters in common and take the ratio to the characters unshared, perhaps that would be a good heuristic?

在最后，我需要一个布尔决定是否要考虑他们同样与否。这并不一定是完美的，但它应该理想地很少是错误的。

In the end, I need a Boolean decision as to whether to consider them the same or not. It doesn't have to be perfect, but it should ideally rarely be wrong.

我可以使用哪些算法，会给我一些量化为两个字符串的相似程度给对方，我将其转换成一个肯定的/一些启发式的方法没有答案？

What algorithm can I use that will give me some kind of quantification as to how similar the two strings are to each other which I can then convert into a yes/no answer by way of some heuristic?

有哪些算法用于比较两个字符串的相似程度？ [英] What are some algorithms for comparing how similar two strings are?

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录关闭

有哪些算法用于比较两个字符串的相似程度？ [英] What are some algorithms for comparing how similar two strings are?

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录 关闭

登录关闭