群集(尤其是字符串群集)如何工作? [英] How does clustering (especially String clustering) work?

查看:103
本文介绍了群集(尤其是字符串群集)如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我听说过将相似数据分组的聚类。我想知道它在String的特定情况下是如何工作的。

I heard about clustering to group similar data. I want to know how it works in the specific case for String.

我有一个表,该表包含的不同单词超过100,000个。

I have a table with more than different 100,000 words.

我想识别相同的单词,但有一些区别(例如: house,house !!,hooouse,HoUse,@house, house ,等等... )。

I want to identify the same word with some differences (eg.: house, house!!, hooouse, HoUse, @house, "house", etc...).

需要什么来识别相似性并将每个单词分组到一个集群中?为此,更推荐使用哪种算法?

What is needed to identify the similarity and group each word in a cluster? What algorithm is more recommended for this?

推荐答案

要了解什么是集群,可以想象一下地理地图。您可以看到许多不同的对象(例如房屋)。它们中的一些彼此靠近,而另一些则相距甚远。基于此,您可以将所有对象分成组(例如城市)。聚类算法正是解决了这一问题-它们使您可以将数据分为多个组,而无需事先指定组边界。

To understand what clustering is imagine a geographical map. You can see many distinct objects (such as houses). Some of them are close to each other, and others are far. Based on this, you can split all objects into groups (such as cities). Clustering algorithms make exactly this thing - they allow you to split your data into groups without previous specifying groups borders.

所有聚类算法都是基于两个对象之间的距离(或似然性)。在地理地图上,这是2个房屋之间的正常距离,在多维空间中,它可能是欧几里得距离(实际上,地图上2个房屋之间的距离也就是欧几里得距离)。为了进行字符串比较,您必须使用其他方法。此处2个不错的选择是锤击 Levenshtein距离。在您的特定情况下,更优选 Levenshtein距离(锤距仅适用于相同大小的琴弦)。

All clustering algorithms are based on the distance (or likelihood) between 2 objects. On geographical map it is normal distance between 2 houses, in multidimensional space it may be Euclidean distance (in fact, distance between 2 houses on the map also is Euclidean distance). For string comparison you have to use something different. 2 good choices here are Hamming and Levenshtein distance. In your particular case Levenshtein distance if more preferable (Hamming distance works only with the strings of same size).

现在您可以使用现有的聚类算法之一。有很多,但并非所有都可以满足您的需求。例如,这里已经提到过的纯k均值几乎无济于事,因为它需要找到组的初始数量,并且如果使用大的字符串字典,它可能是100、200、500、10000-您只是不知道数量。因此,其他算法可能更合适。

Now you can use one of existing clustering algorithms. There's plenty of them, but not all can fit your needs. For example, pure k-means, already mentioned here will hardly help you since it requires initial number of groups to find, and with large dictionary of strings it may be 100, 200, 500, 10000 - you just don't know the number. So other algorithms may be more appropriate.

其中之一是 期望最大化 算法。它的优点是它可以自动找到群集数量。但是,实际上,它给出的结果往往不如其他算法精确,因此通常在EM之上使用 k均值,即先找到具有EM的聚类及其中心,然后再查找使用k均值来调整结果。

One of them is expectation maximization algorithm. Its advantage is that it can find number of clusters automatically. However, in practice often it gives less precise results than other algorithms, so it is normal to use k-means on top of EM, that is, first find number of clusters and their centers with EM and then use k-means to adjust the result.

另一种可能适合您的任务的算法分支是分层聚类。在这种情况下,聚类分析的结果不是一组独立的组,而是树(层次结构),其中几个较小的聚类被分组为一个较大的聚类,而所有聚类最终都成为一个大聚类的一部分。在您的情况下,这意味着所有单词在某种程度上彼此相似。

Another possible branch of algorithms, that may be suitable for your task, is hierarchical clustering. The result of cluster analysis in this case in not a set of independent groups, but rather tree (hierarchy), where several smaller clusters are grouped into one bigger, and all clusters are finally part of one big cluster. In your case it means that all words are similar to each other up to some degree.

这篇关于群集(尤其是字符串群集)如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆