计算k均值的方差测量百分比? [英] Calculating the percentage of variance measure for k-means?

查看:129
本文介绍了计算k均值的方差测量百分比?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Wikipedia页面中,描述了一种肘形方法,用于确定k中的簇数-方法. 内置在scipy方法中提供了一个实现,但是我不确定我如何理解他们所说的失真是如何计算的.

On the Wikipedia page, an elbow method is described for determining the number of clusters in k-means. The built-in method of scipy provides an implementation but I am not sure I understand how the distortion as they call it, is calculated.

更准确地说,如果您用方差图来解释 相对于群集数的群集,第一个群集将 添加很多信息(解释很多差异),但是在某些时候 边际增益将下降,从而在图中形成一个角度.

More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph.

假设我有以下几点及其相关的质心,那么计算此度量的一种好方法是什么?

Assuming that I have the following points with their associated centroids, what is a good way of calculating this measure?

points = numpy.array([[ 0,  0],
       [ 0,  1],
       [ 0, -1],
       [ 1,  0],
       [-1,  0],
       [ 9,  9],
       [ 9, 10],
       [ 9,  8],
       [10,  9],
       [10,  8]])

kmeans(pp,2)
(array([[9, 8],
   [0, 0]]), 0.9414213562373096)

我专门考虑仅给出点和质心来计算0.94 ..度量.我不确定是否可以使用任何内置的scipy方法,或者我必须编写自己的方法.关于如何针对大量积分有效执行此操作的任何建议?

I am specifically looking at computing the 0.94.. measure given just the points and the centroids. I am not sure if any of the inbuilt methods of scipy can be used or I have to write my own. Any suggestions on how to do this efficiently for large number of points?

简而言之,我的问题(所有相关问题)如下:

In short, my questions (all related) are the following:

  • 给出距离矩阵和哪个点属于哪个点的映射 集群,什么是计算可以使用的度量的好方法 画肘图?
  • 如果使用不同的距离函数(例如余弦相似度),该方法将如何改变?
  • Given a distance matrix and a mapping of which point belongs to which cluster, what is a good way of computing a measure that can be used to draw the elbow plot?
  • How would the methodology change if a different distance function such as cosine similarity is used?

失真

from scipy.spatial.distance import cdist
D = cdist(points, centroids, 'euclidean')
sum(numpy.min(D, axis=1))

第一组点的输出准确.但是,当我尝试其他设置时:

The output for the first set of points is accurate. However, when I try a different set:

>>> pp = numpy.array([[1,2], [2,1], [2,2], [1,3], [6,7], [6,5], [7,8], [8,8]])
>>> kmeans(pp, 2)
(array([[6, 7],
       [1, 2]]), 1.1330618877807475)
>>> centroids = numpy.array([[6,7], [1,2]])
>>> D = cdist(points, centroids, 'euclidean')
>>> sum(numpy.min(D, axis=1))
9.0644951022459797

我猜最后一个值不匹配,因为kmeans似乎正在将该值除以数据集中的总点数.

I guess the last value does not match because kmeans seems to be diving the value by the total number of points in the dataset.

修改1:方差百分比

到目前为止,我的代码(应将其添加到Denis的K-means实现中):

My code so far (should be added to Denis's K-means implementation):

centres, xtoc, dist = kmeanssample( points, 2, nsample=2,
        delta=kmdelta, maxiter=kmiter, metric=metric, verbose=0 )

print "Unique clusters: ", set(xtoc)
print ""
cluster_vars = []
for cluster in set(xtoc):
    print "Cluster: ", cluster

    truthcondition = ([x == cluster for x in xtoc])
    distances_inside_cluster = (truthcondition * dist)

    indices = [i for i,x in enumerate(truthcondition) if x == True]
    final_distances = [distances_inside_cluster[k] for k in indices]

    print final_distances
    print np.array(final_distances).var()
    cluster_vars.append(np.array(final_distances).var())
    print ""

print "Sum of variances: ", sum(cluster_vars)
print "Total Variance: ", points.var()
print "Percent: ", (100 * sum(cluster_vars) / points.var())

以下是k = 2的输出:

And following is the output for k=2:

Unique clusters:  set([0, 1])

Cluster:  0
[1.0, 2.0, 0.0, 1.4142135623730951, 1.0]
0.427451660041

Cluster:  1
[0.0, 1.0, 1.0, 1.0, 1.0]
0.16

Sum of variances:  0.587451660041
Total Variance:  21.1475
Percent:  2.77787757437

在我的真实数据集上(对我来说不正确!):

On my real dataset (does not look right to me!):

Sum of variances:  0.0188124746402
Total Variance:  0.00313754329764
Percent:  599.592510943
Unique clusters:  set([0, 1, 2, 3])

Sum of variances:  0.0255808508714
Total Variance:  0.00313754329764
Percent:  815.314672809
Unique clusters:  set([0, 1, 2, 3, 4])

Sum of variances:  0.0588210052519
Total Variance:  0.00313754329764
Percent:  1874.74720416
Unique clusters:  set([0, 1, 2, 3, 4, 5])

Sum of variances:  0.0672406353655
Total Variance:  0.00313754329764
Percent:  2143.09824556
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6])

Sum of variances:  0.0646291452839
Total Variance:  0.00313754329764
Percent:  2059.86465055
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7])

Sum of variances:  0.0817517362176
Total Variance:  0.00313754329764
Percent:  2605.5970695
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8])

Sum of variances:  0.0912820650486
Total Variance:  0.00313754329764
Percent:  2909.34837831
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Sum of variances:  0.102119601368
Total Variance:  0.00313754329764
Percent:  3254.76309585
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

Sum of variances:  0.125549475536
Total Variance:  0.00313754329764
Percent:  4001.52168834
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])

Sum of variances:  0.138469402779
Total Variance:  0.00313754329764
Percent:  4413.30651542
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

推荐答案

直到关注Kmeans ,将其用作停止标准(如果两次迭代之间的变化小于某个阈值,则假定收敛)

The distortion, as far as Kmeans is concerned, is used as a stopping criterion (if the change between two iterations is less than some threshold, we assume convergence)

如果要根据一组点和质心进行计算,则可以执行以下操作(代码在MATLAB中,使用

If you want to calculate it from a set of points and the centroids, you can do the following (the code is in MATLAB using pdist2 function, but it should be straightforward to rewrite in Python/Numpy/Scipy):

% data
X = [0 1 ; 0 -1 ; 1 0 ; -1 0 ; 9 9 ; 9 10 ; 9 8 ; 10 9 ; 10 8];

% centroids
C = [9 8 ; 0 0];

% euclidean distance from each point to each cluster centroid
D = pdist2(X, C, 'euclidean');

% find closest centroid to each point, and the corresponding distance
[distortions,idx] = min(D,[],2);

结果:

% total distortion
>> sum(distortions)
ans =
           9.4142135623731


EDIT#1:

我有一些时间来解决这个问题..这是在'上应用的KMeans群集的示例. Fisher Iris数据集" (4个功能,150个实例).我们遍历k=1..10,绘制肘曲线,选择K=3作为簇数,并显示结果的散点图.


EDIT#1:

I had some time to play around with this.. Here is an example of KMeans clustering applied on the 'Fisher Iris Dataset' (4 features, 150 instances). We iterate over k=1..10, plot the elbow curve, pick K=3 as number of clusters, and show a scatter plot of the result.

请注意,在给定点和质心的情况下,我提供了多种方法来计算集群内方差(失真). scipy.cluster.vq.kmeans 函数返回此值默认情况下进行测量(使用欧几里得作为距离测量值).您还可以使用 scipy.spatial.distance.cdist 函数以您选择的函数来计算距离(前提是您使用相同的距离量度获得了群集质心:

Note that I included a number of ways to compute the within-cluster variances (distortions), given the points and the centroids. The scipy.cluster.vq.kmeans function returns this measure by default (computed with Euclidean as a distance measure). You can also use the scipy.spatial.distance.cdist function to calculate the distances with the function of your choice (provided you obtained the cluster centroids using the same distance measure: @Denis have a solution for that), then compute the distortion from that.

import numpy as np
from scipy.cluster.vq import kmeans,vq
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt

# load the iris dataset
fName = 'C:\\Python27\\Lib\\site-packages\\scipy\\spatial\\tests\\data\\iris.txt'
fp = open(fName)
X = np.loadtxt(fp)
fp.close()

##### cluster data into K=1..10 clusters #####
K = range(1,10)

# scipy.cluster.vq.kmeans
KM = [kmeans(X,k) for k in K]
centroids = [cent for (cent,var) in KM]   # cluster centroids
#avgWithinSS = [var for (cent,var) in KM] # mean within-cluster sum of squares

# alternative: scipy.cluster.vq.vq
#Z = [vq(X,cent) for cent in centroids]
#avgWithinSS = [sum(dist)/X.shape[0] for (cIdx,dist) in Z]

# alternative: scipy.spatial.distance.cdist
D_k = [cdist(X, cent, 'euclidean') for cent in centroids]
cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]
avgWithinSS = [sum(d)/X.shape[0] for d in dist]

##### plot ###
kIdx = 2

# elbow curve
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(K, avgWithinSS, 'b*-')
ax.plot(K[kIdx], avgWithinSS[kIdx], marker='o', markersize=12, 
    markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Average within-cluster sum of squares')
plt.title('Elbow for KMeans clustering')

# scatter plot
fig = plt.figure()
ax = fig.add_subplot(111)
#ax.scatter(X[:,2],X[:,1], s=30, c=cIdx[k])
clr = ['b','g','r','c','m','y','k']
for i in range(K[kIdx]):
    ind = (cIdx[kIdx]==i)
    ax.scatter(X[ind,2],X[ind,1], s=30, c=clr[i], label='Cluster %d'%i)
plt.xlabel('Petal Length')
plt.ylabel('Sepal Width')
plt.title('Iris Dataset, KMeans clustering with K=%d' % K[kIdx])
plt.legend()

plt.show()

作为对这些评论的回应,我在下面使用主成分分析用于将维数从64降低到2:

In response to the comments, I give below another complete example using the NIST hand-written digits dataset: it has 1797 images of digits from 0 to 9, each of size 8-by-8 pixels. I repeat the experiment above slightly modified: Principal Components Analysis is applied to reduce the dimensionality from 64 down to 2:

import numpy as np
from scipy.cluster.vq import kmeans
from scipy.spatial.distance import cdist,pdist
from sklearn import datasets
from sklearn.decomposition import RandomizedPCA
from matplotlib import pyplot as plt
from matplotlib import cm

##### data #####
# load digits dataset
data = datasets.load_digits()
t = data['target']

# perform PCA dimensionality reduction
pca = RandomizedPCA(n_components=2).fit(data['data'])
X = pca.transform(data['data'])

##### cluster data into K=1..20 clusters #####
K_MAX = 20
KK = range(1,K_MAX+1)

KM = [kmeans(X,k) for k in KK]
centroids = [cent for (cent,var) in KM]
D_k = [cdist(X, cent, 'euclidean') for cent in centroids]
cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]

tot_withinss = [sum(d**2) for d in dist]  # Total within-cluster sum of squares
totss = sum(pdist(X)**2)/X.shape[0]       # The total sum of squares
betweenss = totss - tot_withinss          # The between-cluster sum of squares

##### plots #####
kIdx = 9        # K=10
clr = cm.spectral( np.linspace(0,1,10) ).tolist()
mrk = 'os^p<dvh8>+x.'

# elbow curve
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(KK, betweenss/totss*100, 'b*-')
ax.plot(KK[kIdx], betweenss[kIdx]/totss*100, marker='o', markersize=12, 
    markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
ax.set_ylim((0,100))
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Percentage of variance explained (%)')
plt.title('Elbow for KMeans clustering')

# show centroids for K=10 clusters
plt.figure()
for i in range(kIdx+1):
    img = pca.inverse_transform(centroids[kIdx][i]).reshape(8,8)
    ax = plt.subplot(3,4,i+1)
    ax.set_xticks([])
    ax.set_yticks([])
    plt.imshow(img, cmap=cm.gray)
    plt.title( 'Cluster %d' % i )

# compare K=10 clustering vs. actual digits (PCA projections)
fig = plt.figure()
ax = fig.add_subplot(121)
for i in range(10):
    ind = (t==i)
    ax.scatter(X[ind,0],X[ind,1], s=35, c=clr[i], marker=mrk[i], label='%d'%i)
plt.legend()
plt.title('Actual Digits')
ax = fig.add_subplot(122)
for i in range(kIdx+1):
    ind = (cIdx[kIdx]==i)
    ax.scatter(X[ind,0],X[ind,1], s=35, c=clr[i], marker=mrk[i], label='C%d'%i)
plt.legend()
plt.title('K=%d clusters'%KK[kIdx])

plt.show()

您可以看到某些簇实际上如何对应可区分的数字,而另一些簇却不匹配单个数字.

You can see how some clusters actually correspond to distinguishable digits, while others don't match a single number.

注意:包含 K均值的实现在 scikit-learn 中(以及许多其他群集算法和各种集群指标). 此处是另一个相似的示例.

Note: An implementation of K-means is included in scikit-learn (as well as many other clustering algorithms and various clustering metrics). Here is another similar example.

这篇关于计算k均值的方差测量百分比?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆