究竟怎样的k-means ++的工作? [英] How exactly does k-means++ work?

查看:225
本文介绍了究竟怎样的k-means ++的工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法完全理解的K均值++算法。我有兴趣究竟如何前k重心采摘(剩下的就像是在原有的K均值)。

I am having trouble fully understanding the k-means++ algorithm. I am interested exactly how the first k centroids are picked (the rest is like in the original k-means).

  1. 是所使用的概率函数基于距离或高斯?
  2. 在同一时间最长远点(从另一个形心)被拾取为新的质心。

我会AP preciate一步步的解释和一个例子。一个在维基百科不够清晰。另外一个很好的注释的源$ C ​​$ C也会有所帮助。如果您使用的是6阵列,那么请告诉我们,哪一个是为了什么。

I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well commented source code would also help. If you are using 6 arrays then please tell us which one is for what.

推荐答案

有趣的问题。谢谢你提出这个纸我的注意。原皮 PDF这里。

Interesting question. Thank you for bringing this paper to my attention. PDF link here of the original paper.

在简单来说,聚类中心最初随机选择的一组输入观察矢量,这里选择的概率向量 X 为高,如果 X 不靠近任何previously选择中心。

In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x is high if x is not near any previously chosen centers.

下面是一个一维的例子。我们的观察是[0,1,2,3,4]。让第一中锋, C1 ,为0的概率,未来聚类中心, C2 ,则x是成正比以 || C1-X || ^ 2 。因此,P(C2 = 1)= 1a中,P(C 2 = 2)=4α,P(C2 = 3)= 9a中,P(C 2 = 4)= 16a中,其中a = 1 /(1 + 4 + 9 + 16)。

Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).

假设C2 = 4。然后,P(C3 = 1)= 1a中,P(C3 = 2)=4α,P(C3 = 3)= 1a中,其中a = 1 /(1 + 4 + 1)。

Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).

我有codeD Python中的初始化程序;我不知道这是否可以帮助你。

I've coded the initialization procedure in Python; I don't know if this helps you.

def initialize(X, K):
    C = [X[0]]
    for k in range(1, K):
        D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])
        probs = D2/D2.sum()
        cumprobs = probs.cumsum()
        r = scipy.rand()
        for j,p in enumerate(cumprobs):
            if r < p:
                i = j
                break
        C.append(X[i])
    return C

编辑与澄清: cumsum 的输出给我们的界限来划分区间[0,1]。这些分区具有长度等于相应点的概率被选择为中心。那么,因为研究均匀之间的选择[0,1],它会下降,因为破发)。该循环检查,看看哪个分区研究

EDIT with clarification: The output of cumsum gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because of break). The for loop checks to see which partition r is in.

例如:

probs = [0.1, 0.2, 0.3, 0.4]
cumprobs = [0.1, 0.3, 0.6, 1.0]
if r < cumprobs[0]:
    # this event has probability 0.1
    i = 0
elif r < cumprobs[1]:
    # this event has probability 0.2
    i = 1
elif r < cumprobs[2]:
    # this event has probability 0.3
    i = 2
elif r < cumprobs[3]:
    # this event has probability 0.4
    i = 3

这篇关于究竟怎样的k-means ++的工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆