如何实现K-Means ++算法? [英] How Could One Implement the K-Means++ Algorithm?

查看：94 发布时间：2020/4/26 10:18:34 algorithm language-agnostic machine-learning cluster-analysis k-means

本文介绍了如何实现K-Means ++算法?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我无法完全理解 K-Means ++算法 .我很感兴趣第一个k重心的选取方式，即初始化，其余的就像原始的

I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm.

使用的概率函数是基于距离还是高斯?
同时，从另一个质心中选择最长的距离点作为新质心.

我将欣赏逐步说明和示例. 维基百科中的一个不够清晰.同样，注释良好的源代码也将有所帮助.如果您使用的是6个数组，请告诉我们哪个数组代表什么.

I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well commented source code would also help. If you are using 6 arrays then please tell us which one is for what.

推荐答案

有趣的问题.感谢您引起我的注意- K-Means ++:优势播种的过程

Interesting question. Thank you for bringing this paper to my attention - K-Means++: The Advantages of Careful Seeding

简而言之，聚类中心最初是从一组输入观察向量中随机选择的，如果x不在先前选择的中心附近，则选择向量x的可能性就很高.

In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x is high if x is not near any previously chosen centers.

这里是一维示例.我们的观察值为[0，1，2，3，4].假设第一个中心c1为0.下一个聚类中心c2为x的概率与||c1-x||^2成正比.因此，P(c2 = 1)= 1a，P(c2 = 2)= 4a，P(c2 = 3)= 9a，P(c2 = 4)= 16a，其中a = 1/(1 + 4 + 9 + 16).

Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).

假设c2 = 4.然后，P(c3 = 1)= 1a，P(c3 = 2)= 4a，P(c3 = 3)= 1a，其中a = 1/(1 + 4 + 1).

Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).

我已经用Python编写了初始化过程；我不知道这是否对您有帮助.

I've coded the initialization procedure in Python; I don't know if this helps you.

def initialize(X, K):
    C = [X[0]]
    for k in range(1, K):
        D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X])
        probs = D2/D2.sum()
        cumprobs = probs.cumsum()
        r = scipy.rand()
        for j,p in enumerate(cumprobs):
            if r < p:
                i = j
                break
        C.append(X[i])
    return C

编辑需要澄清:cumsum的输出为我们提供了划分区间[0,1]的边界.这些分区的长度等于相应点被选择为中心的概率.因此，由于r在[0,1]之间统一选择，因此它将恰好落入这些间隔之一(由于break). for循环检查以查看r在哪个分区中.

EDIT with clarification: The output of cumsum gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because of break). The for loop checks to see which partition r is in.

示例:

probs = [0.1, 0.2, 0.3, 0.4]
cumprobs = [0.1, 0.3, 0.6, 1.0]
if r < cumprobs[0]:
    # this event has probability 0.1
    i = 0
elif r < cumprobs[1]:
    # this event has probability 0.2
    i = 1
elif r < cumprobs[2]:
    # this event has probability 0.3
    i = 2
elif r < cumprobs[3]:
    # this event has probability 0.4
    i = 3

这篇关于如何实现K-Means ++算法?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何实现K-Means ++算法? [英] How Could One Implement the K-Means++ Algorithm?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

如何实现K-Means ++算法? [英] How Could One Implement the K-Means++ Algorithm?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭