如何实现K-Means ++算法? [英] How Could One Implement the K-Means++ Algorithm?
问题描述
我无法完全理解 K-Means ++算法 .我很感兴趣第一个k
重心的选取方式,即初始化,其余的就像原始的
I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k
centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm.
- 使用的概率函数是基于距离还是高斯?
- 同时,从另一个质心中选择最长的距离点作为新质心.
我将欣赏逐步说明和示例. 维基百科中的一个不够清晰.同样,注释良好的源代码也将有所帮助.如果您使用的是6个数组,请告诉我们哪个数组代表什么.
I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well commented source code would also help. If you are using 6 arrays then please tell us which one is for what.
推荐答案
有趣的问题.感谢您引起我的注意- K-Means ++:优势播种的过程
Interesting question. Thank you for bringing this paper to my attention - K-Means++: The Advantages of Careful Seeding
简而言之,聚类中心最初是从一组输入观察向量中随机选择的,如果x
不在先前选择的中心附近,则选择向量x
的可能性就很高.
In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x
is high if x
is not near any previously chosen centers.
这里是一维示例.我们的观察值为[0,1,2,3,4].假设第一个中心c1
为0.下一个聚类中心c2
为x的概率与||c1-x||^2
成正比.因此,P(c2 = 1)= 1a,P(c2 = 2)= 4a,P(c2 = 3)= 9a,P(c2 = 4)= 16a,其中a = 1/(1 + 4 + 9 + 16).
Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1
, be 0. The probability that the next cluster center, c2
, is x is proportional to ||c1-x||^2
. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).
假设c2 = 4.然后,P(c3 = 1)= 1a,P(c3 = 2)= 4a,P(c3 = 3)= 1a,其中a = 1/(1 + 4 + 1).
Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).
我已经用Python编写了初始化过程;我不知道这是否对您有帮助.
I've coded the initialization procedure in Python; I don't know if this helps you.
def initialize(X, K):
C = [X[0]]
for k in range(1, K):
D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])
probs = D2/D2.sum()
cumprobs = probs.cumsum()
r = scipy.rand()
for j,p in enumerate(cumprobs):
if r < p:
i = j
break
C.append(X[i])
return C
编辑需要澄清:cumsum
的输出为我们提供了划分区间[0,1]的边界.这些分区的长度等于相应点被选择为中心的概率.因此,由于r
在[0,1]之间统一选择,因此它将恰好落入这些间隔之一(由于break
). for
循环检查以查看r
在哪个分区中.
EDIT with clarification: The output of cumsum
gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, since r
is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because of break
). The for
loop checks to see which partition r
is in.
示例:
probs = [0.1, 0.2, 0.3, 0.4]
cumprobs = [0.1, 0.3, 0.6, 1.0]
if r < cumprobs[0]:
# this event has probability 0.1
i = 0
elif r < cumprobs[1]:
# this event has probability 0.2
i = 1
elif r < cumprobs[2]:
# this event has probability 0.3
i = 2
elif r < cumprobs[3]:
# this event has probability 0.4
i = 3
这篇关于如何实现K-Means ++算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!