计算k均值的方差测量百分比? [英] Calculating the percentage of variance measure for k-means?

查看：129 发布时间：2020/4/26 10:18:38 python numpy statistics cluster-analysis k-means

本文介绍了计算k均值的方差测量百分比?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在 Wikipedia页面中，描述了一种肘形方法，用于确定k中的簇数-方法. 内置在scipy方法中提供了一个实现，但是我不确定我如何理解他们所说的失真是如何计算的.

On the Wikipedia page, an elbow method is described for determining the number of clusters in k-means. The built-in method of scipy provides an implementation but I am not sure I understand how the distortion as they call it, is calculated.

更准确地说，如果您用方差图来解释相对于群集数的群集，第一个群集将添加很多信息(解释很多差异)，但是在某些时候边际增益将下降，从而在图中形成一个角度.

More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph.

假设我有以下几点及其相关的质心，那么计算此度量的一种好方法是什么?

Assuming that I have the following points with their associated centroids, what is a good way of calculating this measure?

points = numpy.array([[ 0,  0],
       [ 0,  1],
       [ 0, -1],
       [ 1,  0],
       [-1,  0],
       [ 9,  9],
       [ 9, 10],
       [ 9,  8],
       [10,  9],
       [10,  8]])

kmeans(pp,2)
(array([[9, 8],
   [0, 0]]), 0.9414213562373096)

我专门考虑仅给出点和质心来计算0.94 ..度量.我不确定是否可以使用任何内置的scipy方法，或者我必须编写自己的方法.关于如何针对大量积分有效执行此操作的任何建议?

I am specifically looking at computing the 0.94.. measure given just the points and the centroids. I am not sure if any of the inbuilt methods of scipy can be used or I have to write my own. Any suggestions on how to do this efficiently for large number of points?

简而言之，我的问题(所有相关问题)如下:

In short, my questions (all related) are the following:

给出距离矩阵和哪个点属于哪个点的映射集群，什么是计算可以使用的度量的好方法画肘图?
如果使用不同的距离函数(例如余弦相似度)，该方法将如何改变?

Given a distance matrix and a mapping of which point belongs to which cluster, what is a good way of computing a measure that can be used to draw the elbow plot?
How would the methodology change if a different distance function such as cosine similarity is used?

失真

from scipy.spatial.distance import cdist
D = cdist(points, centroids, 'euclidean')
sum(numpy.min(D, axis=1))

第一组点的输出准确.但是，当我尝试其他设置时:

The output for the first set of points is accurate. However, when I try a different set:

>>> pp = numpy.array([[1,2], [2,1], [2,2], [1,3], [6,7], [6,5], [7,8], [8,8]])
>>> kmeans(pp, 2)
(array([[6, 7],
       [1, 2]]), 1.1330618877807475)
>>> centroids = numpy.array([[6,7], [1,2]])
>>> D = cdist(points, centroids, 'euclidean')
>>> sum(numpy.min(D, axis=1))
9.0644951022459797

我猜最后一个值不匹配，因为kmeans似乎正在将该值除以数据集中的总点数.

I guess the last value does not match because kmeans seems to be diving the value by the total number of points in the dataset.

修改1:方差百分比

到目前为止，我的代码(应将其添加到Denis的K-means实现中):

My code so far (should be added to Denis's K-means implementation):

centres, xtoc, dist = kmeanssample( points, 2, nsample=2,
        delta=kmdelta, maxiter=kmiter, metric=metric, verbose=0 )

print "Unique clusters: ", set(xtoc)
print ""
cluster_vars = []
for cluster in set(xtoc):
    print "Cluster: ", cluster

    truthcondition = ([x == cluster for x in xtoc])
    distances_inside_cluster = (truthcondition * dist)

    indices = [i for i,x in enumerate(truthcondition) if x == True]
    final_distances = [distances_inside_cluster[k] for k in indices]

    print final_distances
    print np.array(final_distances).var()
    cluster_vars.append(np.array(final_distances).var())
    print ""

print "Sum of variances: ", sum(cluster_vars)
print "Total Variance: ", points.var()
print "Percent: ", (100 * sum(cluster_vars) / points.var())

以下是k = 2的输出:

And following is the output for k=2:

Unique clusters:  set([0, 1])

Cluster:  0
[1.0, 2.0, 0.0, 1.4142135623730951, 1.0]
0.427451660041

Cluster:  1
[0.0, 1.0, 1.0, 1.0, 1.0]
0.16

Sum of variances:  0.587451660041
Total Variance:  21.1475
Percent:  2.77787757437

在我的真实数据集上(对我来说不正确！):

On my real dataset (does not look right to me!):

Sum of variances:  0.0188124746402
Total Variance:  0.00313754329764
Percent:  599.592510943
Unique clusters:  set([0, 1, 2, 3])

Sum of variances:  0.0255808508714
Total Variance:  0.00313754329764
Percent:  815.314672809
Unique clusters:  set([0, 1, 2, 3, 4])

Sum of variances:  0.0588210052519
Total Variance:  0.00313754329764
Percent:  1874.74720416
Unique clusters:  set([0, 1, 2, 3, 4, 5])

Sum of variances:  0.0672406353655
Total Variance:  0.00313754329764
Percent:  2143.09824556
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6])

Sum of variances:  0.0646291452839
Total Variance:  0.00313754329764
Percent:  2059.86465055
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7])

Sum of variances:  0.0817517362176
Total Variance:  0.00313754329764
Percent:  2605.5970695
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8])

Sum of variances:  0.0912820650486
Total Variance:  0.00313754329764
Percent:  2909.34837831
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Sum of variances:  0.102119601368
Total Variance:  0.00313754329764
Percent:  3254.76309585
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

Sum of variances:  0.125549475536
Total Variance:  0.00313754329764
Percent:  4001.52168834
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])

Sum of variances:  0.138469402779
Total Variance:  0.00313754329764
Percent:  4413.30651542
Unique clusters:  set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

EDIT#1:

我有一些时间来解决这个问题..这是在 '上应用的KMeans群集的示例. Fisher Iris数据集" (4个功能，150个实例).我们遍历k=1..10，绘制肘曲线，选择K=3作为簇数，并显示结果的散点图.

EDIT#1:

I had some time to play around with this.. Here is an example of KMeans clustering applied on the 'Fisher Iris Dataset' (4 features, 150 instances). We iterate over k=1..10, plot the elbow curve, pick K=3 as number of clusters, and show a scatter plot of the result.

请注意，在给定点和质心的情况下，我提供了多种方法来计算集群内方差(失真). scipy.cluster.vq.kmeans 函数返回此值默认情况下进行测量(使用欧几里得作为距离测量值).您还可以使用 scipy.spatial.distance.cdist 函数以您选择的函数来计算距离(前提是您使用相同的距离量度获得了群集质心:

Note that I included a number of ways to compute the within-cluster variances (distortions), given the points and the centroids. The scipy.cluster.vq.kmeans function returns this measure by default (computed with Euclidean as a distance measure). You can also use the scipy.spatial.distance.cdist function to calculate the distances with the function of your choice (provided you obtained the cluster centroids using the same distance measure: @Denis have a solution for that), then compute the distortion from that.

import numpy as np
from scipy.cluster.vq import kmeans,vq
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt

# load the iris dataset
fName = 'C:\\Python27\\Lib\\site-packages\\scipy\\spatial\\tests\\data\\iris.txt'
fp = open(fName)
X = np.loadtxt(fp)
fp.close()

##### cluster data into K=1..10 clusters #####
K = range(1,10)

# scipy.cluster.vq.kmeans
KM = [kmeans(X,k) for k in K]
centroids = [cent for (cent,var) in KM]   # cluster centroids
#avgWithinSS = [var for (cent,var) in KM] # mean within-cluster sum of squares

# alternative: scipy.cluster.vq.vq
#Z = [vq(X,cent) for cent in centroids]
#avgWithinSS = [sum(dist)/X.shape[0] for (cIdx,dist) in Z]

# alternative: scipy.spatial.distance.cdist
D_k = [cdist(X, cent, 'euclidean') for cent in centroids]
cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]
avgWithinSS = [sum(d)/X.shape[0] for d in dist]

##### plot ###
kIdx = 2

# elbow curve
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(K, avgWithinSS, 'b*-')
ax.plot(K[kIdx], avgWithinSS[kIdx], marker='o', markersize=12, 
    markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Average within-cluster sum of squares')
plt.title('Elbow for KMeans clustering')

# scatter plot
fig = plt.figure()
ax = fig.add_subplot(111)
#ax.scatter(X[:,2],X[:,1], s=30, c=cIdx[k])
clr = ['b','g','r','c','m','y','k']
for i in range(K[kIdx]):
    ind = (cIdx[kIdx]==i)
    ax.scatter(X[ind,2],X[ind,1], s=30, c=clr[i], label='Cluster %d'%i)
plt.xlabel('Petal Length')
plt.ylabel('Sepal Width')
plt.title('Iris Dataset, KMeans clustering with K=%d' % K[kIdx])
plt.legend()

plt.show()

作为对这些评论的回应，我在下面使用主成分分析用于将维数从64降低到2:

In response to the comments, I give below another complete example using the NIST hand-written digits dataset: it has 1797 images of digits from 0 to 9, each of size 8-by-8 pixels. I repeat the experiment above slightly modified: Principal Components Analysis is applied to reduce the dimensionality from 64 down to 2:

import numpy as np
from scipy.cluster.vq import kmeans
from scipy.spatial.distance import cdist,pdist
from sklearn import datasets
from sklearn.decomposition import RandomizedPCA
from matplotlib import pyplot as plt
from matplotlib import cm

##### data #####
# load digits dataset
data = datasets.load_digits()
t = data['target']

# perform PCA dimensionality reduction
pca = RandomizedPCA(n_components=2).fit(data['data'])
X = pca.transform(data['data'])

##### cluster data into K=1..20 clusters #####
K_MAX = 20
KK = range(1,K_MAX+1)

KM = [kmeans(X,k) for k in KK]
centroids = [cent for (cent,var) in KM]
D_k = [cdist(X, cent, 'euclidean') for cent in centroids]
cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]

tot_withinss = [sum(d**2) for d in dist]  # Total within-cluster sum of squares
totss = sum(pdist(X)**2)/X.shape[0]       # The total sum of squares
betweenss = totss - tot_withinss          # The between-cluster sum of squares

##### plots #####
kIdx = 9        # K=10
clr = cm.spectral( np.linspace(0,1,10) ).tolist()
mrk = 'os^p<dvh8>+x.'

# elbow curve
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(KK, betweenss/totss*100, 'b*-')
ax.plot(KK[kIdx], betweenss[kIdx]/totss*100, marker='o', markersize=12, 
    markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
ax.set_ylim((0,100))
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Percentage of variance explained (%)')
plt.title('Elbow for KMeans clustering')

# show centroids for K=10 clusters
plt.figure()
for i in range(kIdx+1):
    img = pca.inverse_transform(centroids[kIdx][i]).reshape(8,8)
    ax = plt.subplot(3,4,i+1)
    ax.set_xticks([])
    ax.set_yticks([])
    plt.imshow(img, cmap=cm.gray)
    plt.title( 'Cluster %d' % i )

# compare K=10 clustering vs. actual digits (PCA projections)
fig = plt.figure()
ax = fig.add_subplot(121)
for i in range(10):
    ind = (t==i)
    ax.scatter(X[ind,0],X[ind,1], s=35, c=clr[i], marker=mrk[i], label='%d'%i)
plt.legend()
plt.title('Actual Digits')
ax = fig.add_subplot(122)
for i in range(kIdx+1):
    ind = (cIdx[kIdx]==i)
    ax.scatter(X[ind,0],X[ind,1], s=35, c=clr[i], marker=mrk[i], label='C%d'%i)
plt.legend()
plt.title('K=%d clusters'%KK[kIdx])

plt.show()

您可以看到某些簇实际上如何对应可区分的数字，而另一些簇却不匹配单个数字.

You can see how some clusters actually correspond to distinguishable digits, while others don't match a single number.

注意:包含 K均值的实现在 scikit-learn 中(以及许多其他群集算法和各种集群指标). 此处是另一个相似的示例.

Note: An implementation of K-means is included in scikit-learn (as well as many other clustering algorithms and various clustering metrics). Here is another similar example.

这篇关于计算k均值的方差测量百分比?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

计算k均值的方差测量百分比? [英] Calculating the percentage of variance measure for k-means?

问题描述

推荐答案

EDIT#1:

EDIT#1:

相关文章

Python最新文章

热门教程

热门工具

登录关闭

计算k均值的方差测量百分比? [英] Calculating the percentage of variance measure for k-means?

问题描述

推荐答案

EDIT#1:

EDIT#1:

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭