使用聚类分析选择最相异的人 [英] Select the most dissimilar individual using cluster analysis
问题描述
我想将我的数据聚类为5个聚类,那么我们需要从所有数据中选择50个具有最相似关系的个体。这意味着如果群集1包含100,两个包含200,三个包含400,四个包含200,以及五个100,则我必须从第一个群集中选择5 +从第二个群集中选择10 +从第三个群集中选择20 +从第四个群集中选择10 +从第五名到第五名。
I want to cluster my data to say 5 clusters, then we need to select 50 individuals with most dissimilar relationship from all the data. That means if cluster one contains 100, two contains 200, three contains 400, four contains 200, and five 100, I have to select 5 from the first cluster + 10 from the second cluster + 20 from the third + 10 from the fourth + 5 from the fifth.
数据示例:
mydata<-matrix(nrow=100,ncol=10,rnorm(1000, mean = 0, sd = 1))
我做了什么到目前为止,是对数据进行聚类并对每个聚类中的个人进行排名,然后将其导出到excel并从那里开始...
因为我的数据变得非常大,这已经成为问题。
What I did till now is clustering the data and rank the individuals within each cluster, then export it to excel and go from there … That has become became a problem since my data has became really big.
对于在R
中应用前一个方法的任何帮助或建议,我将不胜感激。
I will appreciate any help or suggestion on how to apply the previous in R .
推荐答案
关于您的评论,请找到以下代码:
regarding you comment, find the code below:
请注意,可以在美观和效率方面提高代码。
此外,我使用了第二个答案,因为否则会很混乱。
pleas note that the code can be improved in terms of beauty and efficiency. Further I used a second answer, because otherwise it would be to messy.
# calculation of centroits based on:
# https://stat.ethz.ch/pipermail/r-help/2006-May/105328.html
cl <- hclust(dist(mydata, diag = TRUE, upper=TRUE))
cent <- tapply(mydata,
list(rep(cutree(cl, 5), ncol(mydata)), col(mydata)), mean)
dimnames(cent) <- list(NULL, dimnames(mydata)[[2]])
# add up cluster number and data and split by cluster
newdf <- data.frame(data=mydata, cluster=cutree(cl, k=5))
newdfl <- split(newdf, f=newdf$cluster)
# add centroids and drop cluster info
totaldf <- lapply(1:5,
function(i, li, cen) rbind(cen[i, ], li[[i]][ , -11]),
li=newdfl, cen=cent)
# calculate new distance to centroits and sort them
dist_to_cent <- lapply(totaldf, function(x)
sort(as.matrix(dist(x, diag=TRUE, upper=TRUE))[1, ]))
dist_to_cent
要计算 hclust
中的质心,请参阅 R-邮件列表
for calculation of centroids out of hclust
see R-Mailinglist
这篇关于使用聚类分析选择最相异的人的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!