R levenshtein距离中的聚类 [英] Clustering in R levenshtein distance
问题描述
我正在尝试使用levenshtein距离使用kmeans聚类.我很难插拔结果.
I am trying to use kmeans clustering using the levenshtein distance. I am having hard time in interpeting the results.
# courtesy: code is borrowed from the other thread listed below with some additions of k-means clustering
set.seed(1)
rstr <- function(n,k){ # vector of n random char(k) strings
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
plot(hc)
# to normalize the distances when there are unequal length sequences
max<- max(d)
data<- d/max
k.means.fit <- kmeans(data, 3)
library(cluster)
clusplot(d, k.means.fit$cluster, main='Clustering',
color=TRUE, shade=TRUE,
labels=5, lines=0, col.p = "dark green")
那么,簇的图是什么,我该如何解释呢?我提到了其他讨论它们集中在两个主要组件上的线程. https://stats.stackexchange.com/questions/274754/how-解释R中的clusplot
so, what does the cluster plot and how can I interpret it? I referred to other threads where they discuss that is clustered on two principal components. https://stats.stackexchange.com/questions/274754/how-to-interpret-the-clusplot-in-r
但是不清楚如何解释该图以及为什么这些点在该椭圆/簇中?有任何想法吗?谢谢!
But it was not clear how to explain the figure and why those points are in that ellipse/ cluster? Any ideas? Thanks!!
推荐答案
这非常简单.您将字符串构造为三组. 您有十个以"aa"开头的字符串,十个以"bb"开头和十个以"cc"开头的字符串.在这些开头之后,字符串的其余部分是随机的.使用Levenshtein距离,您会希望这些以相同的前两个字母开头的字符串彼此靠近.当您查看分层聚类的图时,很容易看到由字符串的前两个字母定义的三个主要组.当您使用km = 3的kmeans时,会得到相同的集群.您可以通过检查集群来查看
This is pretty straightforward. You constructed your strings to be in three groups. You have ten strings that start with 'aa', ten with 'bb' and ten with 'cc'. After those beginnings, the rest of the string is random. Using Levenshtein distance, you would expect these strings that start with the same first two letters to be close to each other. When you look at the plot of the hierarchical clustering it is easy to see three main groups defined by the first two letters of the strings. When you use kmeans with k=3, you get the same clusters. You can see this by checking the clusters
k.means.fit$cluster
aagjo aaxfx aayrq aabfe aarju aamsz aajuy aafqd aagka aajwi bbmpm bbevr bbucs
1 1 1 1 1 1 1 1 1 1 3 3 3
bbkvq bbuon bbuam bbtsm bbwlg bbbci bbnrk ccxhl cciqg ccmtc ccwiv ccjim ccxwk
3 3 3 3 3 3 3 2 2 2 2 2 2
ccuyl ccski cctfs ccdgd
2 2 2 2
集群1以'aa'开头的字符串,集群2以'cc'开头,集群3以'bb'开头.
Cluster 1 is the strings that start with 'aa' cluster 2 starts with 'cc' and cluster 3 starts with 'bb'.
这篇关于R levenshtein距离中的聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!