使用聚类分析选择最相异的人 [英] Select the most dissimilar individual using cluster analysis

查看:84
本文介绍了使用聚类分析选择最相异的人的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将我的数据聚类为5个聚类,那么我们需要从所有数据中选择50个具有最相似关系的个体。这意味着如果群集1包含100,两个包含200,三个包含400,四个包含200,以及五个100,则我必须从第一个群集中选择5 +从第二个群集中选择10 +从第三个群集中选择20 +从第四个群集中选择10 +从第五名到第五名。

I want to cluster my data to say 5 clusters, then we need to select 50 individuals with most dissimilar relationship from all the data. That means if cluster one contains 100, two contains 200, three contains 400, four contains 200, and five 100, I have to select 5 from the first cluster + 10 from the second cluster + 20 from the third + 10 from the fourth + 5 from the fifth.

数据示例:

     mydata<-matrix(nrow=100,ncol=10,rnorm(1000, mean = 0, sd = 1))

我做了什么到目前为止,是对数据进行聚类并对每个聚类中的个人进行排名,然后将其导出到excel并从那里开始...
因为我的数据变得非常大,这已经成为问题。

What I did till now is clustering the data and rank the individuals within each cluster, then export it to excel and go from there … That has become became a problem since my data has became really big.

对于在R
中应用前一个方法的任何帮助或建议,我将不胜感激。

I will appreciate any help or suggestion on how to apply the previous in R .

推荐答案

关于您的评论,请找到以下代码:

regarding you comment, find the code below:

请注意,可以在美观和效率方面提高代码。
此外,我使用了第二个答案,因为否则会很混乱。

pleas note that the code can be improved in terms of beauty and efficiency. Further I used a second answer, because otherwise it would be to messy.

# calculation of centroits based on:
# https://stat.ethz.ch/pipermail/r-help/2006-May/105328.html
cl <- hclust(dist(mydata, diag = TRUE, upper=TRUE))
cent <- tapply(mydata, 
        list(rep(cutree(cl, 5), ncol(mydata)), col(mydata)), mean)
dimnames(cent) <- list(NULL, dimnames(mydata)[[2]])


# add up cluster number and data and split by cluster
newdf <- data.frame(data=mydata, cluster=cutree(cl, k=5))
newdfl <- split(newdf, f=newdf$cluster)

# add centroids and drop cluster info
totaldf <- lapply(1:5, 
           function(i, li, cen) rbind(cen[i, ], li[[i]][ , -11]),
                                li=newdfl, cen=cent)


# calculate new distance to centroits and sort them
dist_to_cent <- lapply(totaldf, function(x)
                   sort(as.matrix(dist(x, diag=TRUE, upper=TRUE))[1, ]))
dist_to_cent

要计算 hclust 中的质心,请参阅 R-邮件列表

for calculation of centroids out of hclust see R-Mailinglist

这篇关于使用聚类分析选择最相异的人的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆