具有 Levenshtein 距离的文本聚类 [英] Text clustering with Levenshtein distances

查看:39
本文介绍了具有 Levenshtein 距离的文本聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组 (2k - 4k) 的小字符串(3-6 个字符),我想对它们进行聚类.由于我使用字符串,集群(尤其是字符串集群)如何工作?a>,告诉我 Levenshtein distance 很适合用作字符串的距离函数.另外,由于我事先不知道集群的数量,层次聚类是要走的路而不是 k 均值.

I have a set (2k - 4k) of small strings (3-6 characters) and I want to cluster them. Since I use strings, previous answers on How does clustering (especially String clustering) work?, informed me that Levenshtein distance is good to be used as a distance function for strings. Also, since I do not know in advance the number of clusters, hierarchical clustering is the way to go and not k-means.

虽然我以抽象的形式理解了这个问题,但我不知道实际解决问题的简单方法是什么.例如,MATLAB 或 R 是使用自定义函数(Levenshtein distance)实际实现层次聚类的更好选择.对于这两种软件,可以轻松找到 Levenshtein 距离实现.聚类部分似乎更难.例如 MATLAB 中的聚类文本 计算所有字符串的距离数组,但我无法理解如何使用距离数组来实际获得聚类.你们中的任何一位大师都可以向我展示如何使用自定义函数在 MATLAB 或 R 中实现层次聚类的方法吗?

Although I get the problem in its abstract form, I do not know what is the easie way to actually do it. For example, is MATLAB or R a better choice for the actual implementation of hierarchical clustering with the custom function (Levenshtein distance). For both software, one may easily find a Levenshtein distance implementation. The clustering part seems harder. For example Clustering text in MATLAB calculates the distance array for all strings, but I cannot understand how to use the distance array to actually get the clustering. Can you any of you gurus show me the way to how to implement the hierarchical clustering in either MATLAB or R with a custom function?

推荐答案

这可能有点简单,但这里有一个代码示例,它在 R 中使用基于 Levenshtein 距离的层次聚类.

This may be a bit simplistic, but here's a code example that uses hierarchical clustering based on Levenshtein distance in R.

set.seed(1)
rstr <- function(n,k){   # vector of n random char(k) strings
  sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}

str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d  <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=3)
df <- data.frame(str,cutree(hc,k=3))

在这个例子中,我们人为地创建了一组 30 个随机 char(5) 字符串,分为 3 组(以aa"、bb"和cc"开头).我们使用 adist(...) 计算 Levenshtein 距离矩阵,并使用 hclust(...) 运行层次聚类.然后我们用 cutree(...) 将树状图分成三个簇,并将簇 id 附加到原始字符串中.

In this example, we create a set of 30 random char(5) strings artificially in 3 groups (starting with "aa", "bb", and "cc"). We calculate the Levenshtein distance matrix using adist(...), and we run heirarchal clustering using hclust(...). Then we cut the dendrogram into three clusters with cutree(...) and append the cluster id's to the original strings.

这篇关于具有 Levenshtein 距离的文本聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆