寻找类似的字符串的群体在一个大组字符串 [英] Finding groups of similar strings in a large set of strings

查看:168
本文介绍了寻找类似的字符串的群体在一个大组字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当大的组字符串(例如100),其中有一些子组特点是相似的。我试图找到/设计一个算法,就会发现提纲群体合理有效。

I have a reasonably large set of strings (say 100) which has a number of subgroups characterised by their similarity. I am trying to find/design an algorithm which would find theses groups reasonably efficiently.

作为一个例子,让我们说,输入列表是在左侧的下方,输出组是在正确的。

As an example let's say the input list is on the left below, and the output groups are on the right.

Input                           Output
-----------------               -----------------
Jane Doe                        Mr Philip Roberts
Mr Philip Roberts               Phil Roberts     
Foo McBar                       Philip Roberts   
David Jones                     
Phil Roberts                    Foo McBar        
Davey Jones            =>         
John Smith                      David Jones      
Philip Roberts                  Dave Jones       
Dave Jones                      Davey Jones      
Jonny Smith                     
                                Jane Doe         

                                John Smith       
                                Jonny Smith 

是否有人知道任何方法可以解决这个合理有效?

Does anybody know of any ways to solve this reasonably efficiently?

查找相似字符串的标准方法似乎是Levenshtein距离,但我看不出我怎么能利用好,在这里,而不必每次字符串比较所有其它的字符串列表中,然后以某种方式决定上用于判定是否两个串都在相同的基团或不差阈

The standard method for finding similar strings seems to be the Levenshtein distance, but I can't see how I can make good use of that here without having to compare every string to every other string in the list, and then somehow decide on a difference threshold for deciding if the two strings are in the same group or not.

另一种方法是一种算法,散列字符串向下的整数,其中相似字符串散列整数它们并拢于数行。我不知道什么样的算法,这将是虽然,如果存在的话

An alternative would be an algorithm that hashes strings down to an integer, where similar strings hash to integers which are close together on the number-line. I have no idea what algorithm that would be though, if one even exists

没有任何人有任何想法/指针?

Does anybody have any thoughts/pointers?

更新: @Will答:也许名字是不如我首先想到的一个例子。作为一个起点,我想我可以假设,在我将要共事的数据,在一个串一个小的变化不会让它从​​一个组跳转到另一个。

UPDATE: @Will A: Perhaps names weren't as good an example as I first thought. As a starting point I think I can assume that in the data I will be working with, a small change in a string will not make it jump from one group to another.

推荐答案

另一种常用的方法是字符串由他们杰卡德指数相关联。以<一href="http://en.wikipedia.org/wiki/Jaccard_index">http://en.wikipedia.org/wiki/Jaccard_index.

Another popular method is to associate the strings by their Jaccard index. Start with http://en.wikipedia.org/wiki/Jaccard_index.

下面是一个关于使用杰卡德指数(以及一些其他的方法)来解决像你这样的问题的文章:

Here's a article about using the Jaccard-index (and a couple of other methods) to solve a problem like yours:

http://matpalm.com/resemblance/

这篇关于寻找类似的字符串的群体在一个大组字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆