Matlab K-均值聚类评估 [英] matlab k-means clustering evaluation

查看:610
本文介绍了Matlab K-均值聚类评估的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何有效评估标准matlab k-means实现的性能.

How effectively evaluate the performance of the standard matlab k-means implementation.

例如,我有一个矩阵X

For example I have a matrix X

X = [1  2;
     3  4;
     2  5;
     83 76;
     97 89] 

对于每一点,我都有一个黄金标准聚类.假设(83,76),(97,89)是第一个聚类,而(1,2),(3,4),(2,5)是第二个聚类.然后我们运行matlab

For every point I have a gold standard clustering. Let's assume that (83,76), (97,89) is the first cluster and (1,2), (3,4), (2,5) is the second cluster. Then we run matlab

idx = kmeans(X,2)

并获得以下结果

idx = [1; 1; 2; 2; 2]

根据NOMINAL值,这是非常糟糕的聚类,因为只有(2,5)是正确的,但我们不在乎名义值,我们只在乎聚在一起的点.因此,我们必须以某种方式确定只有(2,5)到达错误的簇.

According the the NOMINAL values it's very bad clustering because only (2,5) is correct, but we don't care about nominal values, we care only about points that is clustered together. Therefore somehow we have to identify that only (2,5) gets to the incorrect cluster.

对我来说,使用matlab的新手来评估集群的性能并不是一件容易的事.如果您能与我们分享您有关如何评估效果的想法,我将不胜感激.

For me a newbie in matlab is not a trivial task to evaluate the performance of clustering. I would appreciate if you can share with us your ideas about how to evaluate the performance.

推荐答案

评估最佳聚类"有些含糊,尤其是当您在两个不同的组中都有可能最终在其特征方面交叉的点时.当您遇到这种情况时,如何准确定义这些点合并到的群集?这是 Fisher Iris数据集中的示例,您可以将其预先加载到MATLAB中.让我们专门获取间隔号和间隔号,它们是数据矩阵的第三和第四列,并绘制setosavirginica类:

To evaluate the "best clustering" is somewhat ambiguous, especially if you have points in two different groups that may eventually cross over with respect to their features. When you get this case, how exactly do you define which cluster those points get merged to? Here's an example from the Fisher Iris dataset that you can get preloaded with MATLAB. Let's specifically take the sepal width and sepal length, which is the third and fourth columns of the data matrix, and plot the setosa and virginica classes:

load fisheriris;
plot(meas(101:150,3), meas(101:150,4), 'b.', meas(51:100,3), meas(51:100,4), 'r.', 'MarkerSize', 24)

这就是我们得到的:

您可以看到中间有一些重叠.您很幸运,因为您知道预先存在的聚类,因此可以测量准确度,但是如果我们要获取上述数据,并且我们不知道每个点所属的标签,您如何知道中间点属于哪个群集?

You can see that towards the middle, there is some overlap. You are lucky in that you knew what the clusters were before hand and so you can measure what the accuracy is, but if we were to get data such as the above and we didn't know what labels each point belonged to, how do you know which cluster the middle points belong to?

相反,您应该做的是尝试通过多次运行kmeans来最大程度地减少这些分类错误.具体来说,您可以通过执行以下操作来覆盖kmeans的行为:

Instead, what you should do is try and minimize these classification errors by running kmeans more than once. Specifically, you can override the behaviour of kmeans by doing the following:

idx = kmeans(X, 2, 'Replicates', num);

'Replicates'标志告诉kmeans总共运行num次.在运行kmeans num次之后,输出成员资格是那些在kmeans运行的所有时间内算法被认为是最佳的成员资格.我不愿讨论,但它们确定所有成员输出中的最佳"平均值,并为您提供这些平均值.

The 'Replicates' flag tells kmeans to run for a total of num times. After running kmeans num times, the output memberships are those which the algorithm deemed to be the best over all of those times kmeans ran. I won't go into it, but they determine what the "best" average is out of all of the membership outputs and gives you those.

未设置Replicates标志显然默认为运行一次.因此,请尝试增加运行kmeans的总次数,以使您更有可能获得更高质量的集群成员资格.通过设置num = 10,我们将获得您的数据:

Not setting the Replicates flag obviously defaults to running one time. As such, try increasing the total number of times kmeans runs so that you have a higher probability of getting a higher quality of cluster memberships. By setting num = 10, this is what we get with your data:

X = [1  2;
     3  4;
     2  5;
     83 76;
     97 89];
num = 10;
idx = kmeans(X, 2, 'Replicates', num)

idx =

     2
     2
     2
     1
     1

您将看到前三个点属于一个群集,而后两个点属于另一个群集.即使ID被翻转,也没关系,因为我们希望确保组之间有清晰的分隔.

You'll see that the first three points belong to one cluster while the last two points belong to another. Even though the IDs are flipped, it doesn't matter as we want to be sure that there is a clear separation between the groups.

如果您看一下上面的评论,您会发现有几个人尝试对您的数据运行kmeans算法,他们收到了不同的聚类结果.原因是因为kmeans为聚类中心选择初始点时,会以随机方式选择这些点.因此,根据他们的随机数生成器所处的状态,不能保证为一个人选择的初始点将与另一个人相同.

If you take a look at the comments above, you'll notice that several people tried running the kmeans algorithm on your data and they received different clustering results. The reason why is because when kmeans chooses the initial points for your cluster centres, these are chosen in a random fashion. As such, depending on what state their random number generator was in, it is not guaranteed that the initial points chosen for one person will be the same as another person.

因此,如果要获得可重复的结果,则应在运行kmeans之前将随机种子生成器的随机种子设置为相同.在该注释上,请尝试使用 rng 和一个整数,如123一样,之前已经知道.如果我们在上面的代码之前执行此操作,则运行该代码的每个人都将能够重现相同的结果.

Therefore, if you want reproducible results, you should set the random seed of your random seed generator to be the same before running kmeans. On that note, try using rng with an integer that is known before hand, like 123. If we did this before the code above, everyone who runs the code will be able to reproduce the same results.

因此:

rng(123);
X = [1  2;
     3  4;
     2  5;
     83 76;
     97 89]; 
num = 10;
idx = kmeans(X, 2, 'Replicates', num)

idx = 

    1
    1
    1
    2
    2

这里的标签是相反的,但是我保证如果有其他代码运行上面的代码,它们将获得与上面每次生成的标签相同的标签.

Here the labels are reversed, but I guarantee that if any else runs the above code, they will get the same labelling as what was produced above each time.

这篇关于Matlab K-均值聚类评估的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆