在大量字符串中查找相似字符串组 [英] Finding groups of similar strings in a large set of strings
问题描述
我有一组相当大的字符串(比如 100 个),其中包含许多以相似性为特征的子组.我正在尝试寻找/设计一种算法,可以合理有效地找到这些组.
I have a reasonably large set of strings (say 100) which has a number of subgroups characterised by their similarity. I am trying to find/design an algorithm which would find theses groups reasonably efficiently.
举个例子,假设输入列表在下方左侧,输出组在右侧.
As an example let's say the input list is on the left below, and the output groups are on the right.
Input Output
----------------- -----------------
Jane Doe Mr Philip Roberts
Mr Philip Roberts Phil Roberts
Foo McBar Philip Roberts
David Jones
Phil Roberts Foo McBar
Davey Jones =>
John Smith David Jones
Philip Roberts Dave Jones
Dave Jones Davey Jones
Jonny Smith
Jane Doe
John Smith
Jonny Smith
有人知道有什么方法可以合理有效地解决这个问题吗?
Does anybody know of any ways to solve this reasonably efficiently?
寻找相似字符串的标准方法似乎是 Levenshtein 距离,但我不知道如何在这里充分利用它,而不必将每个字符串与列表中的每个其他字符串进行比较,然后以某种方式决定确定两个字符串是否在同一组中的差异阈值.
The standard method for finding similar strings seems to be the Levenshtein distance, but I can't see how I can make good use of that here without having to compare every string to every other string in the list, and then somehow decide on a difference threshold for deciding if the two strings are in the same group or not.
另一种方法是将字符串散列到整数的算法,其中类似的字符串散列到在数轴上靠在一起的整数.我不知道那会是什么算法,如果存在的话
An alternative would be an algorithm that hashes strings down to an integer, where similar strings hash to integers which are close together on the number-line. I have no idea what algorithm that would be though, if one even exists
有人有任何想法/指点吗?
Does anybody have any thoughts/pointers?
更新:@Will A:也许名字不像我最初想的那样是一个很好的例子.作为起点,我想我可以假设在我将使用的数据中,字符串中的微小变化不会使其从一组跳到另一组.
UPDATE: @Will A: Perhaps names weren't as good an example as I first thought. As a starting point I think I can assume that in the data I will be working with, a small change in a string will not make it jump from one group to another.
推荐答案
另一种流行的方法是通过字符串的 Jaccard 索引关联字符串.从 http://en.wikipedia.org/wiki/Jaccard_index 开始.
Another popular method is to associate the strings by their Jaccard index. Start with http://en.wikipedia.org/wiki/Jaccard_index.
这里有一篇关于使用 Jaccard-index(以及其他几种方法)解决像您这样的问题的文章:
Here's a article about using the Jaccard-index (and a couple of other methods) to solve a problem like yours:
http://matpalm.com/resemblance/
这篇关于在大量字符串中查找相似字符串组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!