基于相似单词序列的字符串聚类 [英] Clustering Strings Based on Similar Word Sequences

查看:67
本文介绍了基于相似单词序列的字符串聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种有效的方法,以基于相似单词序列的出现将大约一千万个字符串聚类为聚类.

I am looking for an efficient way to cluster about 10 million strings into clusters based on the appearance of similar word sequences.

考虑如下字符串列表:

the fruit hut number one
the ice cre  am shop number one
jim's taco
ice cream shop in the corner
the ice cream shop
the fruit hut
jim's taco outlet number one
jim's t  aco in the corner
the fruit hut in the corner

算法在它们上运行后,我希望它们按如下所示聚类:

After the algorithm runs on them I want them clustered as follows:

the ice cre  am shop number one
ice cream shop in the corner
the ice cream shop

jim's taco
jim's taco outlet number one
jim's t  aco in the corner

the fruit hut
fruit hut number one
the fruit hut in the corner

很明显,区分簇的顺序是:

As it is obvious, the sequences that differentiate the clusters are:

ice cream shop, jim's taco and fruit hut

推荐答案

我认为您正在寻找近重复检测,具有一些未知阈值,您将不仅将近重复"用于聚类-而且也有足够相似的文档.

I think you are looking for Near Duplicates Detection, with some unknown threshold you will use to cluster not only "near duplicates" - but also similar enough documents together.

一种已知的解决方案是使用 Jaccard相似性 来获取两个文档之间的差异.

One of the known solutions to it is to use Jaccard-Similarity for getting the difference between two documents.

Jaccard相似性基本上是-从每个文档中获取单词集,让这些集分别为s1s2-且jaccard相似性为|s1 [intersection] s2|/|s1 [union] s2|.

Jaccard Similarity is basically - get sets of words from each document, let these sets be s1 and s2 - and the jaccard similarity is |s1 [intersection] s2|/|s1 [union] s2|.

通常,当面对重复项时-单词的顺序很重要.为了处理它-在生成集合s1s2时-实际上您生成的是k-shingling(或k-gram)集,而不是仅单词集.
在您的示例中,使用k=2,集合将为: 拐角处的冰淇淋店

Usually when facing near duplicates - the order of words has some importance however. In order to deal with it - when generating the sets s1 and s2 - you actually generate sets of k-shinglings (or k-grams), instead sets of only words.
In your example, with k=2, the sets will be: ice cream shop in the corner

s2 = { the ice, ice cre, cre am, am shop, shop number, number one }
s4 = {ice cream, cream shop, shop in, in the, the corner }
s5 = { the ice, ice cream, cream shop }

s4 [union] s5 = { ice cream, cream shop, shop in, in the, the corner, the ice } 
s4 [intersection] s5 = { ice cream, cream shop }

在上面,jaccard相似度为2/6.
在您的情况下,普通的k-shingling可能比使用单个单词(1-shingling)更糟糕,但是您必须测试这些方法.

In the above, the jaccard-similarity will be 2/6.
In your case maybe the ordinary k-shinglings will perform worse than using a single word (1-shingling), but you will have to test these approaches.

可以很好地扩展此过程,以非常有效地处理庞大的集合,而无需检查所有对并创建大量集合.可以在这些讲义(根据作者的注释,我于2年前做了本次演讲).

This procedure can be scaled nicely to deal very efficiently with huge collections, without checking all pairs and creating huge numbers of sets. More details could be found in these lecture notes (I gave this lecture ~2 years ago, based on the author's notes).

完成此过程后,基本上有了一个度量标准d(s1,s2),该度量标准测量每两个句子之间的距离,并且可以使用任何已知的

After you are done with this procedure, you basically have a measure d(s1,s2) that measures the distance between every two sentences, and you can use any known clustering algorithm to cluster them.

免责声明:我在此线程中给出的答案用作实现此目标的基础,

Disclaimer: used my answer from this thread as basis for this after realizing near duplicates might fit here.

这篇关于基于相似单词序列的字符串聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆