了解R中的kmeans聚类 [英] Understanding kmeans clustering in r

查看:232
本文介绍了了解R中的kmeans聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的代码(减去我的问题)生成了这张图:

Below code (minus my questions) generates this graph :

我用->"标记了4个混乱的地方

I have marked 4 areas of confusion with "->"

> m <- matrix(c(1,1,1) , ncol=3)
> 
> x <- rbind(matrix(c(1,0,1) , ncol=3),
+            matrix(c(1,1,1) , ncol=3),
+            matrix(c(1,1,0) , ncol=3),
+            matrix(c(0,1,1) , ncol=3),
+            matrix(c(0,0,1) , ncol=3),
+            matrix(c(0,0,0) , ncol=3),
+            matrix(c(1,1,1) , ncol=3),
+            matrix(c(1,1,1) , ncol=3),
+            matrix(c(1,1,0) , ncol=3),
+            matrix(c(1,0,0) , ncol=3),
+            matrix(c(0,0,1) , ncol=3),
+            matrix(c(0,0,0) , ncol=3),
+            matrix(c(0,0,1) , ncol=3),
+            matrix(c(0,1,1) , ncol=3),
+            matrix(c(1,0,1) , ncol=3),
+            matrix(c(0,1,0) , ncol=3))
> colnames(x) <- c("google", "stackoverflow", "tester")
> (cl <- kmeans(x, 3))

K-means clustering with 3 clusters of sizes 3, 10, 3
-> Where are sizes 3, 10 3 appearing  ?

Cluster means:
     google stackoverflow tester
1 0.6666667           1.0      0
2 0.5000000           0.5      1
3 0.3333333           0.0      0

-> There are three clusters, but what does each number signify ?

Clustering vector:
 [1] 2 2 1 2 2 3 2 2 1 3 2 3 2 2 2 1

-> This looks to be created by summing the values of each matrix but seems to be unordered as second element in this vector is '2' but second element in 'x' is matrix(c(1,1,1) , ncol=3) which is '3'

Within cluster sum of squares by cluster:
[1] 0.6666667 5.0000000 0.6666667
 (between_SS / total_SS =  46.1 %)

-> what are between_SS & total_SS ?

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"        
> plot(x, col = cl$cluster)
> points(cl$centers, col = 1:5, pch = 8, cex = 2)
> 

通过阅读该算法的实现,可以提供这些问题的答案( http://en.wikipedia.org/wiki/K-means_clustering )我看不到r如何计算这些值

Can provide answers to these questions as from reading the implementation of this algorithm (http://en.wikipedia.org/wiki/K-means_clustering) I fail to see how r is computing these values

推荐答案

1.群集大小是什么意思?

您提供了16条记录,并告诉kmeans找到3个集群.它将这16条记录分为3组:A:3条记录,B:10条记录,C:3条记录.

You provided 16 records and told kmeans to find 3 clusters. It clustered those 16 records into 3 groups of A: 3 records, B: 10 records and C: 3 records.

2.集群是什么意思?

这些数字表示每个簇的质心在N维空间中的位置(均值").您有三个聚类,因此您有三个均值.您有3个维度("google","stackoverflow","tester"),因此每个维度都有一个值.读取行中的数字可得出单个质心的位置.

These numbers signify the location in N-Dimensional space of the centroid (the "mean") of each cluster. You have three clusters, so you have three means. You have three dimensions ("google", "stackoverflow", "tester") so you get a value in each dimension. Reading the numbers across the row gives the location of a single centroid.

3.什么是聚类向量?

这是算法给您通过算法的每条记录的簇标签.还记得我之前说过3个大小分别为3、10和3的簇吗?这些聚类标记为1、2和3,并且该算法将每个记录的聚类标签存储在此向量中.在这里,您可以看到存在3个"1",10个"2"和3个"3".这有道理吗?

This is the cluster label the algorithm is giving each record you passed the algorithm. Remember how earlier I said there were 3 clusters of size 3, 10, and 3? These clusters are labeled as 1, 2 and 3, and the algorithm stores the cluster label for each record in this vector. Here, you can see that there are 3 "1"s, 10 "2"s, and 3 "3"s. Does that make sense?

4. between_SS& total_SS?

这是ANOVA中通常使用的表示法.您可能会发现这很有用: http://www-ist .massey.ac.nz/dstirlin/CAST/CAST/HrandBlock/randBlock7.html

This is notation generally used in ANOVA. You might find this helpful: http://www-ist.massey.ac.nz/dstirlin/CAST/CAST/HrandBlock/randBlock7.html

这篇关于了解R中的kmeans聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆