R levenshtein距离中的聚类 [英] Clustering in R levenshtein distance

查看:94
本文介绍了R levenshtein距离中的聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用levenshtein距离使用kmeans聚类.我很难插拔结果.

I am trying to use kmeans clustering using the levenshtein distance. I am having hard time in interpeting the results.

   # courtesy: code is borrowed from the other thread listed below with some additions of k-means clustering 
      set.seed(1)
  rstr <- function(n,k){   # vector of n random char(k) strings
 sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
  }

str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
    # Levenshtein Distance
  d  <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)

# to normalize the distances when there are unequal length sequences 
max<- max(d)
data<- d/max

k.means.fit <- kmeans(data, 3)
library(cluster)
clusplot(d, k.means.fit$cluster, main='Clustering',
     color=TRUE, shade=TRUE,
     labels=5, lines=0, col.p = "dark green")

那么,簇的图是什么,我该如何解释呢?我提到了其他讨论它们集中在两个主要组件上的线程. https://stats.stackexchange.com/questions/274754/how-解释R中的clusplot

so, what does the cluster plot and how can I interpret it? I referred to other threads where they discuss that is clustered on two principal components. https://stats.stackexchange.com/questions/274754/how-to-interpret-the-clusplot-in-r

但是不清楚如何解释该图以及为什么这些点在该椭圆/簇中?有任何想法吗?谢谢!

But it was not clear how to explain the figure and why those points are in that ellipse/ cluster? Any ideas? Thanks!!

推荐答案

这非常简单.您将字符串构造为三组. 您有十个以"aa"开头的字符串,十个以"bb"开头和十个以"cc"开头的字符串.在这些开头之后,字符串的其余部分是随机的.使用Levenshtein距离,您会希望这些以相同的前两个字母开头的字符串彼此靠近.当您查看分层聚类的图时,很容易看到由字符串的前两个字母定义的三个主要组.当您使用km = 3的kmeans时,会得到相同的集群.您可以通过检查集群来查看

This is pretty straightforward. You constructed your strings to be in three groups. You have ten strings that start with 'aa', ten with 'bb' and ten with 'cc'. After those beginnings, the rest of the string is random. Using Levenshtein distance, you would expect these strings that start with the same first two letters to be close to each other. When you look at the plot of the hierarchical clustering it is easy to see three main groups defined by the first two letters of the strings. When you use kmeans with k=3, you get the same clusters. You can see this by checking the clusters

 k.means.fit$cluster
aagjo aaxfx aayrq aabfe aarju aamsz aajuy aafqd aagka aajwi bbmpm bbevr bbucs 
    1     1     1     1     1     1     1     1     1     1     3     3     3 
bbkvq bbuon bbuam bbtsm bbwlg bbbci bbnrk ccxhl cciqg ccmtc ccwiv ccjim ccxwk 
    3     3     3     3     3     3     3     2     2     2     2     2     2 
ccuyl ccski cctfs ccdgd 
    2     2     2     2 

集群1以'aa'开头的字符串,集群2以'cc'开头,集群3以'bb'开头.

Cluster 1 is the strings that start with 'aa' cluster 2 starts with 'cc' and cluster 3 starts with 'bb'.

这篇关于R levenshtein距离中的聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆