评估K均值准确性 [英] Evaluating K-means accuracy

查看:105
本文介绍了评估K均值准确性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在MATLAB中创建了具有4个定义的模式/类的3维随机数据集.我在数据上应用了K均值算法,以查看K均值可以基于创建的4个模式/类对我的样本进行分类.

I created a 3-dimensional random data sets with 4 defined patterns/classes in MATLAB. I applied the K-means algorithm on the data to see how well K-means can classify my samples based on created 4 patterns/classes.

我需要以下方面的帮助;

I need help with the following;

  1. 我可以使用什么函数/代码来评估K-means算法正确识别样本类别的程度?假设我将K = 4设置为下图所示:

  1. 如何自动识别班数(K)?假设我数据中的类是未知的?


我的目的是评估K-mean的准确性,以及数据的变化(通过预处理)如何影响算法识别类的能力.带有MATLAB代码的示例会有所帮助!


My aim is to evaluate K-mean's accuracy and how changes to the data (by pre-processing) affects the algorithm’s ability to identify classes. Examples with MATLAB code would be helpful!

推荐答案

一种用于衡量与已知类标签相比,聚类的良好"程度的基本指标称为纯度.现在,这是一个监督学习的示例,您可以对外部指标有所了解,该指标是基于现实世界数据的实例标签.

One basic metric to measure how "good" your clustering in comparison to your known class labels is called purity. Now this is an example of supervised learning where you have some idea of an external metric that is a labeling of instances based on real world data.

纯度的数学定义如下:

用斯坦福大学的教授的话此处

In words what this means is, quoting from a professor at Stanford university here,

为了计算纯度,将每个聚类分配到该聚类中最常见的类别,然后通过计算正确分配的文档数并除以N来测量此分配的准确性.
To compute purity , each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned documents and dividing by N.

一个简单的示例是,如果您有一个非常幼稚的聚类,它是通过Kmeans生成的,其中k = 2看起来像:

A simple example would be if you had a very naive clustering that was produced via Kmeans with k=2 that looked like:

Cluster1    Label
  1           A         
  5           B
  7           B
  3           B
  2           B

Cluster2    Label
  4           A
  6           A
  8           A
  9           B

在Cluster1中,有4个标签B实例和1个标签A实例,Cluster2有3个标签A实例和1个簇B实例.现在,您正在寻找总纯度,因此这将是纯度的总和每个簇的k,在这种情况下,k = 2.因此,Cluster1的纯度是相对于给定标签的最大实例数除以Cluster1中的实例总数.

In Cluster1 there are 4 instances of label B and 1 instance of label A and Cluster2 has 3 instances with label A and 1 instance of cluster B. Now you are looking for the total purity so that would be the sum of the purities of each cluster, in this case k=2. So the purity of Cluster1 is the maximum number of instances in respect to the given labels divided by the total number of instances in Cluster1.

因此Cluster1的纯度为:

Therefore the purity of Cluster1 is:

4/5 = 0.80

这四个来自以下事实:出现次数最多的标签(B)出现4次,并且群集中总共有5个实例.

The four comes from the fact that the label that occurs the most (B) occurs 4 times and there are 5 total instances in the cluster.

因此得出结论,Cluster2的纯度为:

So this follows that the purity of Cluster2 is:

3/4 = 0.75

现在,总纯度就是1.55的纯度之和.那这告诉我们什么呢?如果群集的纯度为1,则认为该群集是纯"的,因为这表明该群集中的所有实例都具有相同的标签.这意味着您的原始标签分类非常好,而Kmeans的工作也相当出色.整个数据集的最佳"纯度得分将等于您的原始K数簇,因为这意味着每个簇的单个纯度得分均为1.

Now the total purity is just the sum of the purities which is 1.55. So what does this tell us? A cluster is considered to be "pure" if it has a purity of 1 since that indicates that all of the instances in that cluster are of the same label. This means your original label classification was pretty good and that your Kmeans did a pretty good job. The "best" purity score for an entire data set would be equal to your original K-number of clusters since that would imply that every cluster has an individual purity score of 1.

但是,您需要意识到纯度并不总是最好或最有说服力的指标.例如,如果您有10个点,并且选择k = 10,则每个簇的纯度为1,因此总纯度为10,等于k.在这种情况下,最好使用不同的外部度量标准,例如精度,召回率和F度量.如果可以的话,我建议您调查一下.再次重申,这仅在​​监督学习的情况下有用,在这种学习中您已经预先了解标签系统,我相信您的问题就是这种情况.

However, you need to be aware that purity is not always the best or most telling metric. For example, if you had 10 points and you chose k=10 then every cluster would have a purity of 1 and therefore an overall purity of 10 which equal k. In that instance it would be better to use different external metrics such as precision, recall, and F-measure. I would suggest looking into those if you can. And again to reiterate, this is only useful with supervised learning where you have pre-knowledge of a labeling system which I believe is the case from your question.

要回答第二个问题...在没有任何数据先验知识的情况下,选择K个聚类是Kmeans最困难的部分.有一些技术可以通过选择簇和质心的初始K数来缓解出现的问题.可能最常见的是称为Kmeans ++的算法.我建议调查一下以获取更多信息.

To answer your second question... choosing your K number of clusters is the most difficult part for Kmeans without any prior knowledge of the data. There are techniques as to mitigate the problems presented by choosing the initial K-number of clusters and centroids. Probably the most common is an algorithm called Kmeans++. I would suggest looking into that for further info.

这篇关于评估K均值准确性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆