在大量字符串中查找相似字符串组 [英] Finding groups of similar strings in a large set of strings

查看:62
本文介绍了在大量字符串中查找相似字符串组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组相当大的字符串(比如 100 个),其中包含许多以相似性为特征的子组.我正在尝试寻找/设计一种算法,可以合理有效地找到这些组.

I have a reasonably large set of strings (say 100) which has a number of subgroups characterised by their similarity. I am trying to find/design an algorithm which would find theses groups reasonably efficiently.

举个例子,假设输入列表在下方左侧,输出组在右侧.

As an example let's say the input list is on the left below, and the output groups are on the right.

Input                           Output
-----------------               -----------------
Jane Doe                        Mr Philip Roberts
Mr Philip Roberts               Phil Roberts     
Foo McBar                       Philip Roberts   
David Jones                     
Phil Roberts                    Foo McBar        
Davey Jones            =>         
John Smith                      David Jones      
Philip Roberts                  Dave Jones       
Dave Jones                      Davey Jones      
Jonny Smith                     
                                Jane Doe         

                                John Smith       
                                Jonny Smith 

有人知道有什么方法可以合理有效地解决这个问题吗?

Does anybody know of any ways to solve this reasonably efficiently?

寻找相似字符串的标准方法似乎是 Levenshtein 距离,但我不知道如何在这里充分利用它,而不必将每个字符串与列表中的每个其他字符串进行比较,然后以某种方式决定确定两个字符串是否在同一组中的差异阈值.

The standard method for finding similar strings seems to be the Levenshtein distance, but I can't see how I can make good use of that here without having to compare every string to every other string in the list, and then somehow decide on a difference threshold for deciding if the two strings are in the same group or not.

另一种方法是将字符串散列到整数的算法,其中类似的字符串散列到在数轴上靠在一起的整数.我不知道那会是什么算法,如果存在的话

An alternative would be an algorithm that hashes strings down to an integer, where similar strings hash to integers which are close together on the number-line. I have no idea what algorithm that would be though, if one even exists

有人有任何想法/指点吗?

Does anybody have any thoughts/pointers?

更新:@Will A:也许名字不像我最初想的那样是一个很好的例子.作为起点,我想我可以假设在我将使用的数据中,字符串中的微小变化不会使其从一组跳到另一组.

UPDATE: @Will A: Perhaps names weren't as good an example as I first thought. As a starting point I think I can assume that in the data I will be working with, a small change in a string will not make it jump from one group to another.

推荐答案

另一种流行的方法是通过字符串的 Jaccard 索引关联字符串.从 http://en.wikipedia.org/wiki/Jaccard_index 开始.

Another popular method is to associate the strings by their Jaccard index. Start with http://en.wikipedia.org/wiki/Jaccard_index.

这里有一篇关于使用 Jaccard-index(以及其他几种方法)解决像您这样的问题的文章:

Here's a article about using the Jaccard-index (and a couple of other methods) to solve a problem like yours:

http://matpalm.com/resemblance/

这篇关于在大量字符串中查找相似字符串组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆