寻找类似的字符串的群体在一个大组字符串 [英] Finding groups of similar strings in a large set of strings
问题描述
我有一个相当大的组字符串(例如100),其中有一些子组特点是相似的。我试图找到/设计一个算法,就会发现提纲群体合理有效。
I have a reasonably large set of strings (say 100) which has a number of subgroups characterised by their similarity. I am trying to find/design an algorithm which would find theses groups reasonably efficiently.
作为一个例子,让我们说,输入列表是在左侧的下方,输出组是在正确的。
As an example let's say the input list is on the left below, and the output groups are on the right.
Input Output
----------------- -----------------
Jane Doe Mr Philip Roberts
Mr Philip Roberts Phil Roberts
Foo McBar Philip Roberts
David Jones
Phil Roberts Foo McBar
Davey Jones =>
John Smith David Jones
Philip Roberts Dave Jones
Dave Jones Davey Jones
Jonny Smith
Jane Doe
John Smith
Jonny Smith
是否有人知道任何方法可以解决这个合理有效?
Does anybody know of any ways to solve this reasonably efficiently?
查找相似字符串的标准方法似乎是Levenshtein距离,但我看不出我怎么能利用好,在这里,而不必每次字符串比较所有其它的字符串列表中,然后以某种方式决定上用于判定是否两个串都在相同的基团或不差阈
The standard method for finding similar strings seems to be the Levenshtein distance, but I can't see how I can make good use of that here without having to compare every string to every other string in the list, and then somehow decide on a difference threshold for deciding if the two strings are in the same group or not.
另一种方法是一种算法,散列字符串向下的整数,其中相似字符串散列整数它们并拢于数行。我不知道什么样的算法,这将是虽然,如果存在的话
An alternative would be an algorithm that hashes strings down to an integer, where similar strings hash to integers which are close together on the number-line. I have no idea what algorithm that would be though, if one even exists
没有任何人有任何想法/指针?
Does anybody have any thoughts/pointers?
更新: @Will答:也许名字是不如我首先想到的一个例子。作为一个起点,我想我可以假设,在我将要共事的数据,在一个串一个小的变化不会让它从一个组跳转到另一个。
UPDATE: @Will A: Perhaps names weren't as good an example as I first thought. As a starting point I think I can assume that in the data I will be working with, a small change in a string will not make it jump from one group to another.
推荐答案
另一种常用的方法是字符串由他们杰卡德指数相关联。以<一href="http://en.wikipedia.org/wiki/Jaccard_index">http://en.wikipedia.org/wiki/Jaccard_index.
Another popular method is to associate the strings by their Jaccard index. Start with http://en.wikipedia.org/wiki/Jaccard_index.
下面是一个关于使用杰卡德指数(以及一些其他的方法)来解决像你这样的问题的文章:
Here's a article about using the Jaccard-index (and a couple of other methods) to solve a problem like yours:
http://matpalm.com/resemblance/
这篇关于寻找类似的字符串的群体在一个大组字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!