R 聚类分析和具有相关矩阵的树状图 [英] R cluster analysis and dendrogram with correlation matrix

查看:63
本文介绍了R 聚类分析和具有相关矩阵的树状图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须对大量数据进行聚类分析.由于我有很多缺失值,我制作了一个相关矩阵.

I have to perform a cluster analysis on a big amount of data. Since I have a lot of missing values I made a correlation matrix.

corloads = cor(df1[,2:185], use = "pairwise.complete.obs")

现在我遇到了如何继续的问题.我读了很多文章和例子,但没有什么对我有用.我如何才能知道有多少集群适合我?

Now I have problems how to go on. I read a lot of articles and examples, but nothing really works for me. How can I find out how many clusters are good for me?

我已经试过了:

dissimilarity = 1 - corloads
distance = as.dist(dissimilarity) 

plot(hclust(distance), main="Dissimilarity = 1 - Correlation", xlab="") 

我有一个情节,但它非常混乱,我不知道如何阅读以及如何继续.它看起来像这样:

I got a plot, but its very messy and I dont know how to read it and how to go on. It looks like this:

知道如何改进它吗?我能从中得到什么?

Any idea how to improve it? And what can I actually get out of it?

我还想创建一个 Screeplot.我读到会有一条曲线,您可以看到有多少集群是正确的.

I also wanted to create a Screeplot. I read that there will be a curve where you can see how many clusters are correct.

我也进行了聚类分析,选择了2-20个Clusters,但是结果太长了,不知道怎么处理,看什么东西重要.

I also performed a cluster analysis and choose 2-20 Clusters, but the results are so long, I have no idea how to handle it and what things are important to look on.

推荐答案

为了确定最佳簇数",有几种方法可用,尽管这是一个有争议的主题.

To determine the "optimal number of clusters" several methods are available, despite it is a controversy theme.

kgs 有助于获得最佳聚类数.

The kgs is helpful to get the optimal number of clusters.

按照您的代码进行操作:

Following your code one would do:

clus <- hclust(distance)
op_k <- kgs(clus, distance, maxclus = 20)
plot (names (op_k), op_k, xlab="# clusters", ylab="penalty")

因此,根据kgs 函数的最佳聚类数是op_k 的最小值,如您在图中所见.你可以用

So the optimal number of clusters according to the kgs function is the minimum value of op_k, as you can see in the plot. You can get it with

min(op_k)

请注意,我将允许的最大簇数设置为 20.您可以将此参数设置为 NULL.

Note that I set the maximum number of clusters allowed to 20. You can set this argument to NULL.

查看页面了解更多方法.

Check this page for more methods.

希望对你有帮助.

要找到哪个是最佳聚类数,您可以这样做

To find which is the optimal number of clusters, you can do

op_k[which(op_k == min(op_k))]

加号

另见这篇帖子 从@Ben 找到完美的图形答案

Plus

Also see this post to find the perfect graphy answer from @Ben

op_k[which(op_k == min(op_k))]

仍然给予惩罚.要找到最佳聚类数,请使用

still gives penalty. To find the optimal number of clusters, use

as.integer(names(op_k[which(op_k == min(op_k))]))

这篇关于R 聚类分析和具有相关矩阵的树状图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆