我该如何将群集标签与我的'基本事实'匹配Matlab中的标签 [英] How can I match up cluster labels to my 'ground truth' labels in Matlab

查看:91
本文介绍了我该如何将群集标签与我的'基本事实'匹配Matlab中的标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在这里搜索并搜索过,但无济于事.当在Weka中进行聚类时,有一个方便的选择,即聚类的类,它可以匹配算法产生的聚类,例如简单的k均值,将您提供的地面真相"类标签作为类属性.这样我们就可以看到聚类准确性(不正确的百分比).

I have searched here and googled, but to no avail. When clustering in Weka there is a handy option, classes to clusters, which matches up the clusters produced by the algorithm e.g. simple k-means, to the 'ground truth' class labels you supply as the class attribute. So that we can see cluster accuracy (% incorrect).

现在,如何在Matlab中实现此目标,即翻译我的clusterClasses矢量,例如[1, 1, 2, 1, 3, 2, 3, 1, 1, 1]放入与提供的地面真相标签矢量相同的索引中,例如[2, 2, 2, 3, 1, 3]?

Now, how can I achieve this in Matlab, i.e. translate my clusterClasses vector e.g. [1, 1, 2, 1, 3, 2, 3, 1, 1, 1] into the same index as the supplied ground truth labels vector e.g. [2, 2, 2, 3, 1, 3]?

我认为它可能基于集群中心和标签中心,但是我不确定如何实现!

I think it is probably based on cluster centres and label centres, but I'm not sure how to implement!

任何帮助将不胜感激.

Vincent

推荐答案

几个月前,我在做集群时偶然发现了一个类似的问题.我没有搜索很长的内置解决方案(尽管我确信它们必须存在),最终写了我自己的小脚本,以使找到的标签与实际情况最匹配.该代码非常粗糙,但是它应该可以帮助您入门.

I stumbled on a similar problem a couple of months ago while doing clustering. I did not search for built in solutions very long (although I am sure they must exist) and ended up writing my own little script for matching my found labels the best with the ground truth. The code is very crude, but it should get you started.

它基于尝试对标签进行所有可能的重排以使女巫最适合真向量.这意味着在给定结果为yte = [3 3 2 1]且具有基本事实为y = [1 1 2 3]的情况下,脚本将尝试将[3 3 2 1], [3 3 1 2], [2 2 3 1], [2 2 1 3], [1 1 2 3] and [1 1 3 2]y匹配以找到最佳匹配.

It is based on trying all possible rearrangements of the labels to see witch best fit the truth vector. That means that given a clustering result yte = [3 3 2 1] with ground truth y = [1 1 2 3], the script will try to match [3 3 2 1], [3 3 1 2], [2 2 3 1], [2 2 1 3], [1 1 2 3] and [1 1 3 2] with y to find the best match.

这是基于使用内置脚本perms()的原因,该脚本不能处理10个以上的唯一群集.对于7-10个唯一的集群,代码的速度也会趋于缓慢,因为复杂性会随着阶乘的增长而增加.

This is based on using the built in script perms() witch can not handle more than 10 unique clusters. The code can also tend to be slow for 7-10 unique clusters, as the complexity grows as a factorial.

function [accuracy, true_labels, CM] = calculateAccuracy(yte, y)
%# Function for calculating clustering accuray and matching found 
%# labels with true labels. Assumes yte and y both are Nx1 vectors with
%# clustering labels. Does not support fuzzy clustering.
%#
%# Algorithm is based on trying out all reorderings of cluster labels, 
%# e.g. if yte = [1 2 2], try [1 2 2] and [2 1 1] so see witch fit 
%# the truth vector the best. Since this approach makes use of perms(),
%# the code will not run for unique(yte) greater than 10, and it will slow
%# down significantly for number of clusters greater than 7.
%#
%# Input:
%#   yte - result from clustering (y-test)
%#   y   - truth vector
%#
%# Output:
%#   accuracy    -   Overall accuracy for entire clustering (OA). For
%#                   overall error, use OE = 1 - OA.
%#   true_labels -   Vector giving the label rearangement witch best 
%#                   match the truth vector (y).
%#   CM          -   Confusion matrix. If unique(yte) = 4, produce a
%#                   4x4 matrix of the number of different errors and  
%#                   correct clusterings done.

N = length(y);

cluster_names = unique(yte);
accuracy = 0;
maxInd = 1;

perm = perms(unique(y));
[pN pM] = size(perm);

true_labels = y;

for i=1:pN
    flipped_labels = zeros(1,N);
    for cl = 1 : pM
        flipped_labels(yte==cluster_names(cl)) = perm(i,cl);
    end

    testAcc = sum(flipped_labels == y')/N;
    if testAcc > accuracy
        accuracy = testAcc;
        maxInd = i;
        true_labels = flipped_labels;
    end

end

CM = zeros(pM,pM);
for rc = 1 : pM
    for cc = 1 : pM
        CM(rc,cc) = sum( ((y'==rc) .* (true_labels==cc)) );
    end
end

示例:

[acc newLabels CM] = calculateAccuracy([3 2 2 1 2 3]',[1 2 2 3 3 3]')

acc =

0.6667


newLabels =

 1     2     2     3     2     1


CM =

 1     0     0
 0     2     0
 1     1     1

这篇关于我该如何将群集标签与我的'基本事实'匹配Matlab中的标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆