具有相关矩阵的R聚类分析和树状图 [英] R cluster analysis and dendrogram with correlation matrix

查看:313
本文介绍了具有相关矩阵的R聚类分析和树状图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须对大量数据执行聚类分析.由于我有很多缺失值,因此我建立了一个相关矩阵.

I have to perform a cluster analysis on a big amount of data. Since I have a lot of missing values I made a correlation matrix.

corloads = cor(df1[,2:185], use = "pairwise.complete.obs")

现在,我在继续操作时遇到了问题.我阅读了很多文章和示例,但对我来说真的没有任何用处.如何找出对我有利的集群?

Now I have problems how to go on. I read a lot of articles and examples, but nothing really works for me. How can I find out how many clusters are good for me?

我已经尝试过了:

dissimilarity = 1 - corloads
distance = as.dist(dissimilarity) 

plot(hclust(distance), main="Dissimilarity = 1 - Correlation", xlab="") 

我有一个情节,但是它非常混乱,我不知道该如何阅读以及如何进行.看起来像这样:

I got a plot, but its very messy and I dont know how to read it and how to go on. It looks like this:

任何想法如何改进它?我到底能从中得到什么呢?

Any idea how to improve it? And what can I actually get out of it?

我还想创建一个Screeplot.我读到会有一条曲线,您可以在其中看到多少个正确的聚类.

I also wanted to create a Screeplot. I read that there will be a curve where you can see how many clusters are correct.

我还进行了聚类分析,并选择了2-20个聚类,但是结果是如此之长,我不知道如何处理以及看什么很重要.

I also performed a cluster analysis and choose 2-20 Clusters, but the results are so long, I have no idea how to handle it and what things are important to look on.

推荐答案

尽管是一个有争议的主题,但仍有几种方法可以确定最佳簇数".

To determine the "optimal number of clusters" several methods are available, despite it is a controversy theme.

kgs有助于获得最佳的簇数.

The kgs is helpful to get the optimal number of clusters.

遵循您的代码可以做到:

Following your code one would do:

clus <- hclust(distance)
op_k <- kgs(clus, distance, maxclus = 20)
plot (names (op_k), op_k, xlab="# clusters", ylab="penalty")

因此,根据kgs函数的最佳簇数是op_k的最小值,如您在图中所见. 你可以用

So the optimal number of clusters according to the kgs function is the minimum value of op_k, as you can see in the plot. You can get it with

min(op_k)

请注意,我将允许的最大群集数设置为20.您可以将此参数设置为NULL.

Note that I set the maximum number of clusters allowed to 20. You can set this argument to NULL.

页上查看更多方法.

希望它对您有帮助.

要找出哪个是最佳群集数,您可以

To find which is the optimal number of clusters, you can do

op_k[which(op_k == min(op_k))]

另请参见以下 post 从@Ben找到完美的图形答案

Plus

Also see this post to find the perfect graphy answer from @Ben

op_k[which(op_k == min(op_k))]

仍然会受到惩罚.要找到最佳的群集数量,请使用

still gives penalty. To find the optimal number of clusters, use

as.integer(names(op_k[which(op_k == min(op_k))]))

这篇关于具有相关矩阵的R聚类分析和树状图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆