在一大串字符串中查找类似字符串的组 [英] Finding groups of similar strings in a large set of strings

查看:182
本文介绍了在一大串字符串中查找类似字符串的组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组相当大的字符串(例如100),其中有许多以相似性为特征的子组。我试图找到/设计一个可以合理有效地找到这些组的算法。



举个例子,我们假设输入列表在左边,输出组在右边。

 输入输出
--------------- - -----------------
Jane Doe Philip Roberts先生
Philip Roberts先生Phil Roberts先生
Foo McBar Philip Roberts
David Jones
Phil Roberts Foo McBar
Davey Jones =>
约翰·史密斯大卫·琼斯
菲利普·罗伯茨·戴夫琼斯
达夫琼斯Davey琼斯
Jonny史密斯
简·多恩

约翰·史密斯
Jonny Smith

有人知道有什么方法可以有效地解决这个问题吗?



找到类似字符串的标准方法似乎是Levenshtein的距离,但是我无法看到我如何能很好地利用这一点,而无需比较每个字符串到列表中的每个其他字符串,然后以某种方式决定差异阈值来决定两个字符串是否在同一个组中。



另一种可能是算法将字符串哈希到一个整数,其中类似的字符串散列到数字行上靠近在一起的整数。我不知道什么算法会是,如果一个甚至存在



有没有任何想法/指针?





更新:
@Will A:或许名字不如我所想象的那么好。作为起点,我认为我可以假设在我将要使用的数据中,字符串的一个小的变化不会使它从一个组跳到另一个组。

解决方案

另一种流行的方法是将字符串与Jaccard索引相关联。从 http://en.wikipedia.org/wiki/Jaccard_index 开始。



这里有一篇关于使用Jaccard-index(和其他几种方法)来解决像你这样的问题的文章:



< a href =http://matpalm.com/resemblance/ =noreferrer> http://matpalm.com/resemblance/


I have a reasonably large set of strings (say 100) which has a number of subgroups characterised by their similarity. I am trying to find/design an algorithm which would find theses groups reasonably efficiently.

As an example let's say the input list is on the left below, and the output groups are on the right.

Input                           Output
-----------------               -----------------
Jane Doe                        Mr Philip Roberts
Mr Philip Roberts               Phil Roberts     
Foo McBar                       Philip Roberts   
David Jones                     
Phil Roberts                    Foo McBar        
Davey Jones            =>         
John Smith                      David Jones      
Philip Roberts                  Dave Jones       
Dave Jones                      Davey Jones      
Jonny Smith                     
                                Jane Doe         

                                John Smith       
                                Jonny Smith 

Does anybody know of any ways to solve this reasonably efficiently?

The standard method for finding similar strings seems to be the Levenshtein distance, but I can't see how I can make good use of that here without having to compare every string to every other string in the list, and then somehow decide on a difference threshold for deciding if the two strings are in the same group or not.

An alternative would be an algorithm that hashes strings down to an integer, where similar strings hash to integers which are close together on the number-line. I have no idea what algorithm that would be though, if one even exists

Does anybody have any thoughts/pointers?


UPDATE: @Will A: Perhaps names weren't as good an example as I first thought. As a starting point I think I can assume that in the data I will be working with, a small change in a string will not make it jump from one group to another.

解决方案

Another popular method is to associate the strings by their Jaccard index. Start with http://en.wikipedia.org/wiki/Jaccard_index.

Here's a article about using the Jaccard-index (and a couple of other methods) to solve a problem like yours:

http://matpalm.com/resemblance/

这篇关于在一大串字符串中查找类似字符串的组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆