R:使用hclust()进行聚类分析。如何获得集群代表? [英] R: Cluster analysis with hclust(). How to get the cluster representatives?

查看:216
本文介绍了R:使用hclust()进行聚类分析。如何获得集群代表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 R 进行一些聚类分析。我正在使用 hclust()函数,在执行聚类分析后,我想获得每个聚类的聚类代表。

I am doing some cluster analysis with R. I am using the hclust() function and I would like to get, after I perform the cluster analysis, the cluster representative of each cluster.

我将一个群集代表定义为最接近群集质心的实例。

I define a cluster representative as the instances which are closest to the centroid of the cluster.

因此,步骤如下:


  1. 查找聚类的质心

  2. 查找聚类的代表

我已经问过类似的问题,但使用K-means: https://stats.stackexchange.com/questions/251987/cluster-analysis-with-k-means-how-to-get-the-cluster-代表

I have already asked a similar question but using K-means: https://stats.stackexchange.com/questions/251987/cluster-analysis-with-k-means-how-to-get-the-cluster-representatives

在这种情况下,问题在于 hclust 没有给出质心!

The problem, in this case, is that hclust doesn't give the centroids!

例如,说 d 是我的数据,到目前为止,我所做的是:

For example, saying that d are my data, what I have done so far is:

hclust.fit1 <- hclust(d, method="single")     
groups1 <- cutree(hclust.fit1, k=3) # cut tree into 3 clusters

## getting centroids ##

mycentroid <- colMeans(CV)    
clust.centroid = function(i, dat, groups1) {    
  ind = (groups1 == i)   
  colMeans(dat[ind,])
}

centroids <- sapply(unique(groups1), clust.centroid, data, groups1)

但是现在,我正在尝试使用此代码来获取集群代表(我在我问的另一个问题中得到了k均值) :

But now, I was trying to get the cluster representatives with this code (I got it in the other question I asked, for k-means):

index <- c()

for (i in 1:3){    
  rowsum <- rowSums(abs(CV[which(centroids==i),1:3] - centroids[i,]))    
  index[i] <- as.numeric(names(which.min(rowsum)))   
}

它说:


e2中的错误[[j]]:索引超出限制

"Error in e2[[j]] : index out of the limit"

如果有人能给我帮助。谢谢。

I would be grateful if any of you could give me a little help. Thanks.

-(不是)代码的工作示例-

example_data.txt

A,B,C
10.761719,5.452188,7.575762
10.830457,5.158822,7.661588
10.75391,5.500170,7.740330
10.686719,5.286823,7.748297
10.864527,4.883244,7.628730
10.701415,5.345650,7.576218
10.820583,5.151544,7.707404
10.877528,4.786888,7.858234
10.712337,4.744053,7.796390

至于代码:

# Install R packages

#install.packages("fpc")

#install.packages("cluster")

#install.packages("rgl")

library(fpc)
library(cluster)
library(rgl)

CV <- read.csv("example_data")

str(CV)

data <- scale(CV)

d <- dist(data,method = "euclidean")
hclust.fit1 <- hclust(d, method="single") 
groups1 <- cutree(hclust.fit1, k=3) # cut tree into 3 clusters
mycentroid <- colMeans(CV)

clust.centroid = function(i, dat, groups1) {
  ind = (groups1 == i)
  colMeans(dat[ind,])
}

centroids <- sapply(unique(groups1), clust.centroid, CV, groups1)

index <- c()
for (i in 1:3){
  rowsum <- rowSums(abs(CV[which(centroids==i),1:3] - centroids[i,]))
  index[i] <- as.numeric(names(which.min(rowsum)))
}


推荐答案

分层集群不使用(或计算)代表。

Hierarchical clustering does not use (or compute) representatives.

单链接(但其他链接也可能发生),中心 可以位于不同的群集中。仅考虑示例中的前两个数据集:

In particular for single link (but it can also happen for other linkages), the "center" can be in a different cluster. Just consider the top two data sets in example:

此外,质心(均值)与欧几里得距离相连。

Furthermore, the centroid (mean) is connected to Euclidean distance. With other distances, it may be a very bad representative.

因此,请谨慎使用!

无论哪种方式,层次聚类没有定义或计算代表。您将必须自己

Either way, hierarchical clustering does not define or compute a representative. You will have to do this yourself.

这篇关于R:使用hclust()进行聚类分析。如何获得集群代表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆