如何使用R应用层次或k均值聚类分析? [英] How to apply a hierarchical or k-means cluster analysis using R?

查看:110
本文介绍了如何使用R应用层次或k均值聚类分析?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用R进行层次聚类分析。我知道 hclust()函数,但不知道如何在实践中使用它。我一直坚持将数据提供给函数并处理输出。

I want to apply a hierarchical cluster analysis with R. I am aware of the hclust() function but not how to use this in practice; I'm stuck with supplying the data to the function and processing the output.

我还想将层次聚类与产生的聚类进行比较kmeans()。同样,我不确定如何调用此函数或使用/操作该函数的输出。

I would also like to compare the hierarchical clustering with that produced by kmeans(). Again I am not sure how to call this function or use/manipulate the output from it.

我的数据类似于:

## dummy data
require(MASS)
set.seed(1)
dat <- data.frame(mvrnorm(100, mu = c(2,6,3), 
                          Sigma = matrix(c(10,   2,   4,
                                            2,   3, 0.5,
                                            4, 0.5,   2), ncol = 3)))


推荐答案

对于分层聚类分析,很好地了解?hclust 并运行其示例。替代功能在R附带的 cluster 包中。 k -均值聚类在函数 kmeans()中可用以及在 cluster 包中。

For hierarchical cluster analysis take a good look at ?hclust and run its examples. Alternative functions are in the cluster package that comes with R. k-means clustering is available in function kmeans() and also in the cluster package.

对显示的虚拟数据进行简单的层次聚类分析,方法如下:

A simple hierarchical cluster analysis of the dummy data you show would be done as follows:

## dummy data first
require(MASS)
set.seed(1)
dat <- data.frame(mvrnorm(100, mu = c(2,6,3), 
                          Sigma = matrix(c(10,   2,   4,
                                            2,   3, 0.5,
                                            4, 0.5,   2), ncol = 3)))

计算使用欧几里得距离的不相似矩阵(您可以使用任何所需的距离)

Compute the dissimilarity matrix using Euclidean distances (you can use whatever distance you want)

dij <- dist(scale(dat, center = TRUE, scale = TRUE))

然后将它们聚类,例如使用组平均分层方法

Then cluster them, say using the group average hierarchical method

clust <- hclust(dij, method = "average")

打印结果可以得到:

R> clust

Call:
hclust(d = dij, method = "average")

Cluster method   : average 
Distance         : euclidean 
Number of objects: 100
Plot the dendrogram

,但该简单输出掩盖了一个复杂的对象,该对象需要进一步的功能来提取或使用其中包含的信息:

but that simple output belies a complex object that needs further functions to extract or use the information contained therein:

R> str(clust)
List of 7
 $ merge      : int [1:99, 1:2] -12 -17 -40 -30 -73 -23 1 -52 -91 -45 ...
 $ height     : num [1:99] 0.0451 0.0807 0.12 0.1233 0.1445 ...
 $ order      : int [1:100] 84 14 24 67 46 34 49 36 41 52 ...
 $ labels     : NULL
 $ method     : chr "average"
 $ call       : language hclust(d = dij, method = "average")
 $ dist.method: chr "euclidean"
 - attr(*, "class")= chr "hclust"

树状图可以是使用 plot()方法生成的( hang 生成的标签沿x轴位于树状图的底部,而 cex 会将所有标签缩小到70%或正常)

The dendrogram can be generated using the plot() method (hang gets the labels at the bottom of the dendrogram, along the x-axis, and cex just shrinks all the labels to 70% or normal)

plot(clust, hang = -0.01, cex = 0.7)

说我们想要一个3簇解决方案,将树状图切成3组并返回集群成员身份

Say we want a 3-cluster solution, cut the dendrogram to produce 3 groups and return the cluster memberships

R> cutree(clust, k = 3)
  [1] 1 2 1 2 2 3 2 2 2 3 2 2 3 1 2 2 2 2 2 2 2 2 2 1 2 3 2 1 1 2 2 2 2 1 1 1 1
 [38] 2 2 2 1 3 2 2 1 1 3 2 1 2 2 1 2 1 2 2 3 1 2 3 2 2 2 3 1 3 1 2 2 2 3 1 2 1
 [75] 1 2 3 3 3 3 1 3 2 1 2 2 2 1 2 2 1 2 2 2 2 2 3 1 1 1

即, cutree()返回一个向量,该向量的长度与聚类的观察数相同,其元素包含每个组的ID观察属于。成员资格是在将树状图切成指定高度或按适当高度切割以提供指定数量的组时,将每个观测值落入的叶子的ID。

That is, cutree() returns a vector the same length as the number of observations clustered, the elements of which contain the group ID that each observation belongs. The membership is the ID of the leaf into which each observation falls when the dendrogram is cut at a stated height or, as done here, at the appropriate height to provide the stated number of groups.

也许给您足够的钱来进行下去?

Perhaps that gives you enough to be going on with?

对于 k 来说,我们会这样做

For k-means, we would do this

set.seed(2) ## *k*-means uses a random start
klust <- kmeans(scale(dat, center = TRUE, scale = TRUE), centers = 3)
klust

给出

> klust
K-means clustering with 3 clusters of sizes 41, 27, 32

Cluster means:
           X1          X2          X3
1  0.04467551  0.69925741 -0.02678733
2  1.11018549 -0.01169576  1.16870206
3 -0.99395950 -0.88605526 -0.95177110

Clustering vector:
  [1] 3 1 3 2 2 3 1 1 1 1 2 1 1 3 2 3 1 2 1 2 2 1 1 3 2 1 1 3 3 1 2 2 1 3 3 3 3
 [38] 1 2 2 3 1 2 2 3 3 1 2 3 2 1 3 1 3 2 2 1 3 2 1 2 1 1 1 3 1 3 2 1 2 1 3 1 3
 [75] 3 1 1 1 1 1 3 1 2 3 1 1 1 3 1 1 3 2 2 1 2 2 3 3 3 3

Within cluster sum of squares by cluster:
[1] 47.27597 31.52213 42.15803
 (between_SS / total_SS =  59.3 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"

在这里,我们获得了有关 kmeans()返回的对象中组件的一些信息。 $ cluster 组件将产生成员矢量,与我们先前从 cutree()看到的输出相当:

Here we get some information about the components in the object returned by kmeans(). The $cluster component will yield the membership vector, comparable to the output we saw earlier from cutree():

R> klust$cluster
  [1] 3 1 3 2 2 3 1 1 1 1 2 1 1 3 2 3 1 2 1 2 2 1 1 3 2 1 1 3 3 1 2 2 1 3 3 3 3
 [38] 1 2 2 3 1 2 2 3 3 1 2 3 2 1 3 1 3 2 2 1 3 2 1 2 1 1 1 3 1 3 2 1 2 1 3 1 3
 [75] 3 1 1 1 1 1 3 1 2 3 1 1 1 3 1 1 3 2 2 1 2 2 3 3 3 3

在这两种情况下,请注意,我还对数据进行了缩放(标准化),以使每个变量都可以按通用比例进行比较。对于以不同的单位或不同的尺度(如此处的均值和方差)测量的数据,如果要使结果有意义或不以方差较大的变量为主导,则这是重要的数据处理步骤。

In both instances, notice that I also scale (standardise) the data to allow each variable to be compared on a common scale. With data measured in different "units" or on different scales (as here with different means and variances) this is an important data processing step if the results are to be meaningful or not dominated by the variables that have large variances.

这篇关于如何使用R应用层次或k均值聚类分析?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆