在一大串字符串中查找类似字符串的组 [英] Finding groups of similar strings in a large set of strings
问题描述
举个例子,我们假设输入列表在左边,输出组在右边。
输入输出
--------------- - -----------------
Jane Doe Philip Roberts先生
Philip Roberts先生Phil Roberts先生
Foo McBar Philip Roberts
David Jones
Phil Roberts Foo McBar
Davey Jones =>
约翰·史密斯大卫·琼斯
菲利普·罗伯茨·戴夫琼斯
达夫琼斯Davey琼斯
Jonny史密斯
简·多恩
约翰·史密斯
Jonny Smith
有人知道有什么方法可以有效地解决这个问题吗?
找到类似字符串的标准方法似乎是Levenshtein的距离,但是我无法看到我如何能很好地利用这一点,而无需比较每个字符串到列表中的每个其他字符串,然后以某种方式决定差异阈值来决定两个字符串是否在同一个组中。
另一种可能是算法将字符串哈希到一个整数,其中类似的字符串散列到数字行上靠近在一起的整数。我不知道什么算法会是,如果一个甚至存在
有没有任何想法/指针?
更新:
@Will A:或许名字不如我所想象的那么好。作为起点,我认为我可以假设在我将要使用的数据中,字符串的一个小的变化不会使它从一个组跳到另一个组。
另一种流行的方法是将字符串与Jaccard索引相关联。从 http://en.wikipedia.org/wiki/Jaccard_index 开始。
这里有一篇关于使用Jaccard-index(和其他几种方法)来解决像你这样的问题的文章:
< a href =http://matpalm.com/resemblance/ =noreferrer> http://matpalm.com/resemblance/
I have a reasonably large set of strings (say 100) which has a number of subgroups characterised by their similarity. I am trying to find/design an algorithm which would find theses groups reasonably efficiently.
As an example let's say the input list is on the left below, and the output groups are on the right.
Input Output
----------------- -----------------
Jane Doe Mr Philip Roberts
Mr Philip Roberts Phil Roberts
Foo McBar Philip Roberts
David Jones
Phil Roberts Foo McBar
Davey Jones =>
John Smith David Jones
Philip Roberts Dave Jones
Dave Jones Davey Jones
Jonny Smith
Jane Doe
John Smith
Jonny Smith
Does anybody know of any ways to solve this reasonably efficiently?
The standard method for finding similar strings seems to be the Levenshtein distance, but I can't see how I can make good use of that here without having to compare every string to every other string in the list, and then somehow decide on a difference threshold for deciding if the two strings are in the same group or not.
An alternative would be an algorithm that hashes strings down to an integer, where similar strings hash to integers which are close together on the number-line. I have no idea what algorithm that would be though, if one even exists
Does anybody have any thoughts/pointers?
UPDATE: @Will A: Perhaps names weren't as good an example as I first thought. As a starting point I think I can assume that in the data I will be working with, a small change in a string will not make it jump from one group to another.
Another popular method is to associate the strings by their Jaccard index. Start with http://en.wikipedia.org/wiki/Jaccard_index.
Here's a article about using the Jaccard-index (and a couple of other methods) to solve a problem like yours:
http://matpalm.com/resemblance/
这篇关于在一大串字符串中查找类似字符串的组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!