我的Davies-Bouldin Index的python实现正确吗? [英] Is my python implementation of the Davies-Bouldin Index correct?

查看:79
本文介绍了我的Davies-Bouldin Index的python实现正确吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在其中计算戴维斯-布尔丁索引 Python.

I'm trying to calculate the Davies-Bouldin Index in Python.

以下是下面的代码尝试重现的步骤.

Here are the steps the code below tries to reproduce.

5个步骤:

  1. 对于每个聚类,计算每个点到质心之间的欧式距离
  2. 对于每个聚类,计算这些距离的平均值
  3. 对于每对集群,计算其质心之间的欧式距离

然后

  1. 对于每对集群,求出到它们各自质心的平均距离之和(在步骤2中计算),然后除以将它们分开的距离(在步骤3中计算).

最后,

  1. 计算所有这些除法的平均值(=所有索引)以获得整个聚类的Davies-Bouldin索引

代码

def daviesbouldin(X, labels, centroids):

    import numpy as np
    from scipy.spatial.distance import pdist, euclidean

    nbre_of_clusters = len(centroids) #Get the number of clusters
    distances = [[] for e in range(nbre_of_clusters)] #Store intra-cluster distances by cluster
    distances_means = [] #Store the mean of these distances
    DB_indexes = [] #Store Davies_Boulin index of each pair of cluster
    second_cluster_idx = [] #Store index of the second cluster of each pair
    first_cluster_idx = 0 #Set index of first cluster of each pair to 0

    # Step 1: Compute euclidean distances between each point of a cluster to their centroid
    for cluster in range(nbre_of_clusters):
        for point in range(X[labels == cluster].shape[0]):
            distances[cluster].append(euclidean(X[labels == cluster][point], centroids[cluster]))

    # Step 2: Compute the mean of these distances
    for e in distances:
        distances_means.append(np.mean(e))

    # Step 3: Compute euclidean distances between each pair of centroid
    ctrds_distance = pdist(centroids) 

    # Tricky step 4: Compute Davies-Bouldin index of each pair of cluster   
    for i, e in enumerate(e for start in range(1, nbre_of_clusters) for e in range(start, nbre_of_clusters)):
        second_cluster_idx.append(e)
        if second_cluster_idx[i-1] == nbre_of_clusters - 1:
            first_cluster_idx += 1
        DB_indexes.append((distances_means[first_cluster_idx] + distances_means[e]) / ctrds_distance[i])

    # Step 5: Compute the mean of all DB_indexes   
    print("DAVIES-BOULDIN Index: %.5f" % np.mean(DB_indexes)) 

在参数中:

  • X是数据
  • labels是通过聚类算法(即kmeans)计算出的标签
  • centroids是每个星团质心的坐标(即:cluster_centers_)
  • X is the data
  • labels, are the labels computed by a clustering algorithm (i.e: kmeans)
  • centroids are the coordinates of each cluster's centroid (i.e: cluster_centers_)

另外,请注意,我正在使用Python 3

Also, note that I'm using Python 3

问题1 :计算每对质心之间的欧几里得距离是否正确(步骤3)?

QUESTION1: Is the computation of euclidean distances between each pair of centroid correct (step 3)?

问题2 :我对第4步的实施是否正确?

QUESTION2: Is my implementation of step 4 correct?

问题3 :我是否需要规范集群内和集群间的距离?

QUESTION3: Do I need to normalise intra and inter cluster distances ?

有关步骤4的进一步说明

Further explanations on Step 4

假设我们有10个集群. 循环应该计算每对集群的DB索引.

Let's say we have 10 clusters. The loop should compute the DB index of each pair of cluster.

在第一次迭代中:

  • 求和聚类1的距离内平均值(distances_means的索引0)和聚类2的距离内平均值(distances_means的索引1)
  • 将该总和除以2个簇之间的距离(ctrds_distance的索引0)
  • sums intra-distances mean of cluster 1 (index 0 of distances_means) and intra-distances mean of cluster 2 (index 1 of distances_means)
  • divides this sum by the distance between the 2 clusters (index 0 of ctrds_distance)

在第二次迭代中:

  • 求和聚类1的距离内平均值(distances_means的索引0)和聚类3的距离内平均值(distances_means的索引2)
  • 将该总和除以2个簇之间的距离(ctrds_distance的索引1)
  • sums intra-distances mean of cluster 1 (index 0 of distances_means) and intra-distances mean of cluster 3 (index 2 of distances_means)
  • divides this sum by the distance between the 2 clusters (index 1 of ctrds_distance)

以此类推...

以10个群集为例,完整的迭代过程应如下所示:

With the example of 10 clusters, the full iteration process should look like this:

intra-cluster distance intra-cluster distance       distance between their
      of cluster:             of cluster:           centroids(storage num):
         0           +             1            /             0
         0           +             2            /             1
         0           +             3            /             2
         0           +             4            /             3
         0           +             5            /             4
         0           +             6            /             5
         0           +             7            /             6
         0           +             8            /             7
         0           +             9            /             8
         1           +             2            /             9
         1           +             3            /             10
         1           +             4            /             11
         1           +             5            /             12
         1           +             6            /             13
         1           +             7            /             14
         1           +             8            /             15
         1           +             9            /             16
         2           +             3            /             17
         2           +             4            /             18
         2           +             5            /             19
         2           +             6            /             20
         2           +             7            /             21
         2           +             8            /             22
         2           +             9            /             23
         3           +             4            /             24
         3           +             5            /             25
         3           +             6            /             26
         3           +             7            /             27
         3           +             8            /             28
         3           +             9            /             29
         4           +             5            /             30
         4           +             6            /             31
         4           +             7            /             32
         4           +             8            /             33
         4           +             9            /             34
         5           +             6            /             35
         5           +             7            /             36
         5           +             8            /             37
         5           +             9            /             38
         6           +             7            /             39
         6           +             8            /             40
         6           +             9            /             41
         7           +             8            /             42
         7           +             9            /             43
         8           +             9            /             44

这里的问题是我不太确定distances_means的索引是否匹配ctrds_distance的索引.

The problem here is I'm not quite sure that the index of distances_means matches the index of ctrds_distance.

换句话说,我不确定所计算的第一个集群间距离是否对应于集群1与集群2之间的距离.并且不确定所计算的第二个集群间距离是否对应于集群3与集群1之间的距离. ...依此类推,依此类推.

In other words, I'm not sure that the first inter-cluster distance computed corresponds to the distance between cluster 1 and cluster 2. And that the second inter-cluster distance computed corresponds to the distance between cluster 3 and cluster 1... and so on, following the pattern above.

简而言之:恐怕我正在将成对的集群内距离除以不对应的集群间距离.

In short: I'm afraid I'm dividing pairs of intra-cluster distances by an inter-cluster distance that is not corresponding.

任何帮助都受到欢迎!

推荐答案

此处是上述Davies-Bouldin索引朴素实现的更短,更快速的版本.

Here is a shorter, faster corrected version of the Davies-Bouldin index naive implementation above.

def DaviesBouldin(X, labels):
    n_cluster = len(np.bincount(labels))
    cluster_k = [X[labels == k] for k in range(n_cluster)]
    centroids = [np.mean(k, axis = 0) for k in cluster_k]
    variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
    db = []

    for i in range(n_cluster):
        for j in range(n_cluster):
            if j != i:
                db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]))

    return(np.max(db) / n_cluster)

回答我自己的问题:

  • 初稿(步骤4)上的计数器是正确的,但不相关
  • 无需规范簇内和簇间距离
  • 计算欧几里得距离时有一个错误

请注意,您可以找到尝试改进此索引的创新方法,尤其是"新版本Davies-Bouldin Index "(Davies-Bouldin索引"),将欧氏距离替换为圆柱距离.

Note you can find innovative approaches that try to improve this index, notably the "New Version of Davies-Bouldin Index" that replaces Euclidean distance by Cylindrical distance.

这篇关于我的Davies-Bouldin Index的python实现正确吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆