K表示当肘部曲线为平滑曲线时找到肘部 [英] K means finding elbow when the elbow plot is a smooth curve

查看:336
本文介绍了K表示当肘部曲线为平滑曲线时找到肘部的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用以下代码绘制k均值的肘部:

I am trying to plot the elbow of k means using the below code:

load CSDmat %mydata
for k = 2:20
    opts = statset('MaxIter', 500, 'Display', 'off');
    [IDX1,C1,sumd1,D1] = kmeans(CSDmat,k,'Replicates',5,'options',opts,'distance','correlation');% kmeans matlab
    [yy,ii] = min(D1');      %% assign points to nearest center

    distort = 0;
    distort_across = 0;
    clear clusts;
    for nn=1:k
        I = find(ii==nn);       %% indices of points in cluster nn
        J = find(ii~=nn);       %% indices of points not in cluster nn
        clusts{nn} = I;         %% save into clusts cell array
        if (length(I)>0)
            mu(nn,:) = mean(CSDmat(I,:));               %% update mean
            %% Compute within class distortion
            muB = repmat(mu(nn,:),length(I),1);
            distort = distort+sum(sum((CSDmat(I,:)-muB).^2));
            %% Compute across class distortion
            muB = repmat(mu(nn,:),length(J),1);
            distort_across = distort_across + sum(sum((CSDmat(J,:)-muB).^2));
        end
    end
    %% Set distortion as the ratio between the within
    %% class scatter and the across class scatter
    distort = distort/(distort_across+eps);

        bestD(k)=distort;
        bestC=clusts;
end
figure; plot(bestD);

bestD的值(在群集方差内/群集方差之间)为

The values of bestD (within cluster variance/between cluster variance) are

[
0.401970132754914
0.193697163350293
0.119427184084282
0.0872681777446508
0.0687948264457301
0.0566215549396577
0.0481117619129058
0.0420491551659459
0.0361696583755145
0.0320384092689509
0.0288948343304147
0.0262373245283877
0.0239462330460614
0.0218350896369853
0.0201506779033703
0.0186757121130685
0.0176258625858971
0.0163239661159014
0.0154933431470081
]

该代码改编自2005年3月,加州理工学院的Lihi Zelnik-Manor.

The code is adapted from Lihi Zelnik-Manor, March 2005, Caltech.

聚类方差内与聚类方差之间的绘图比率是一条平滑的曲线,其膝盖像曲线一样平滑,绘制了以上给出的bestD数据.我们如何找到此类图的膝盖?

The plot ratio of within cluster variance to between cluster variance is a smooth curve with a knee that is smooth like a curve, plot bestD data given above. How do we find the knee for such graphs?

推荐答案

我认为最好只使用类内失真"作为优化参数:

I think that it is better to use only your "within class distortion" as optimization parameter:

%% Compute within class distortion
muB = repmat(mu(nn,:),length(I),1);
distort = distort+sum(sum((CSDmat(I,:)-muB).^2));

使用此除以该值除以"distort_across".如果您计算出此值的导数":

Use this without dividing this value by "distort_across". If you calculate the "derivate" of this:

unexplained_error = within_class_distortion;
derivative = diff(unexplained_error);
plot(derivative)

导数(k)告诉您,通过添加新的群集,无法解释的错误减少了多少.我建议当此错误的减少量少于您第一次获得的减少量的十倍时,停止添加群集.

The derivative(k) tells you how much the unexplained error has decreased by adding a new cluster. I suggest that you stop adding clusters when the decrease on this error is less than ten times the first decrease you obtained.

for (i=1:length(derivative))
    if (derivative(i) < derivative(1)/10)
         break
    end
end
k_opt = i+1;

实际上,获得最佳簇数的方法取决于应用程序,但是我认为您可以使用此建议获得k的良好值.

In fact the method to obtain the optimum number of clusters is application dependent, but I think that you can obtain a good value of k using this suggestion.

这篇关于K表示当肘部曲线为平滑曲线时找到肘部的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆