如何实现K-Means ++算法? [英] How Could One Implement the K-Means++ Algorithm?

查看:94
本文介绍了如何实现K-Means ++算法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法完全理解 K-Means ++算法 .我很感兴趣第一个k重心的选取方式,即初始化,其余的就像原始的

I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm.

  1. 使用的概率函数是基于距离还是高斯?
  2. 同时,从另一个质心中选择最长的距离点作为新质心.

我将欣赏逐步说明和示例. 维基百科中的一个不够清晰.同样,注释良好的源代码也将有所帮助.如果您使用的是6个数组,请告诉我们哪个数组代表什么.

I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well commented source code would also help. If you are using 6 arrays then please tell us which one is for what.

推荐答案

有趣的问题.感谢您引起我的注意- K-Means ++:优势播种的过程

Interesting question. Thank you for bringing this paper to my attention - K-Means++: The Advantages of Careful Seeding

简而言之,聚类中心最初是从一组输入观察向量中随机选择的,如果x不在先前选择的中心附近,则选择向量x的可能性就很高.

In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x is high if x is not near any previously chosen centers.

这里是一维示例.我们的观察值为[0,1,2,3,4].假设第一个中心c1为0.下一个聚类中心c2为x的概率与||c1-x||^2成正比.因此,P(c2 = 1)= 1a,P(c2 = 2)= 4a,P(c2 = 3)= 9a,P(c2 = 4)= 16a,其中a = 1/(1 + 4 + 9 + 16).

Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).

假设c2 = 4.然后,P(c3 = 1)= 1a,P(c3 = 2)= 4a,P(c3 = 3)= 1a,其中a = 1/(1 + 4 + 1).

Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).

我已经用Python编写了初始化过程;我不知道这是否对您有帮助.

I've coded the initialization procedure in Python; I don't know if this helps you.

def initialize(X, K):
    C = [X[0]]
    for k in range(1, K):
        D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])
        probs = D2/D2.sum()
        cumprobs = probs.cumsum()
        r = scipy.rand()
        for j,p in enumerate(cumprobs):
            if r < p:
                i = j
                break
        C.append(X[i])
    return C

编辑需要澄清:cumsum的输出为我们提供了划分区间[0,1]的边界.这些分区的长度等于相应点被选择为中心的概率.因此,由于r在[0,1]之间统一选择,因此它将恰好落入这些间隔之一(由于break). for循环检查以查看r在哪个分区中.

EDIT with clarification: The output of cumsum gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because of break). The for loop checks to see which partition r is in.

示例:

probs = [0.1, 0.2, 0.3, 0.4]
cumprobs = [0.1, 0.3, 0.6, 1.0]
if r < cumprobs[0]:
    # this event has probability 0.1
    i = 0
elif r < cumprobs[1]:
    # this event has probability 0.2
    i = 1
elif r < cumprobs[2]:
    # this event has probability 0.3
    i = 2
elif r < cumprobs[3]:
    # this event has probability 0.4
    i = 3

这篇关于如何实现K-Means ++算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆