Levenshtein距离的文本聚类 [英] Text clustering with Levenshtein distances

查看:128
本文介绍了Levenshtein距离的文本聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组(2k-4k)小字符串(3-6个字符),我想将它们聚类.由于我使用字符串,因此群集(尤其是字符串群集)的工作原理是什么?,告诉我 Levenshtein距离可以用作字符串的距离函数.另外,由于我事先不知道集群的数量,因此分层集群是可行的方法而不是k均值.

I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that Levenshtein distance is good to be used as a distance function for strings. Also, since I do not know in advance the number of clusters, hierarchical clustering is the way to go and not k-means.

尽管我以抽象的形式遇到问题,但我不知道实际执行此操作的简便方法.例如,对于使用自定义函数(Levenshtein距离)的层次化聚类的实际实现,MATLAB或R是更好的选择. 对于这两种软件,都可以轻松找到Levenshtein距离实现.群集部分似乎更难.例如,在MATLAB中进行聚类会计算所有字符串的距离数组,但是我不明白如何使用distance数组实际获取聚类.各位专家都可以向我展示如何在MATLAB或R中使用自定义函数实现分层聚类的方法吗?

Although I get the problem in its abstract form, I do not know what is the easie way to actually do it. For example, is MATLAB or R a better choice for the actual implementation of hierarchical clustering with the custom function (Levenshtein distance). For both software, one may easily find a Levenshtein distance implementation. The clustering part seems harder. For example Clustering text in MATLAB calculates the distance array for all strings, but I cannot understand how to use the distance array to actually get the clustering. Can you any of you gurus show me the way to how to implement the hierarchical clustering in either MATLAB or R with a custom function?

推荐答案

这可能有点简单,但这是一个代码示例,该示例使用基于R中Levenshtein距离的分层聚类.

This may be a bit simplistic, but here's a code example that uses hierarchical clustering based on Levenshtein distance in R.

set.seed(1)
rstr <- function(n,k){   # vector of n random char(k) strings
  sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}

str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d  <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=3)
df <- data.frame(str,cutree(hc,k=3))

在此示例中,我们分3组人工创建了一组30个随机char(5)字符串(以"aa","bb"和"cc"开头).我们使用adist(...)计算Levenshtein距离矩阵,并使用hclust(...)进行继承聚类.然后,用cutree(...)将树状图切成三个聚类,并将聚类ID附加到原始字符串中.

In this example, we create a set of 30 random char(5) strings artificially in 3 groups (starting with "aa", "bb", and "cc"). We calculate the Levenshtein distance matrix using adist(...), and we run heirarchal clustering using hclust(...). Then we cut the dendrogram into three clusters with cutree(...) and append the cluster id's to the original strings.

这篇关于Levenshtein距离的文本聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆