使用Scipy kmeans进行聚类分析 [英] Using scipy kmeans for cluster analysis

查看:470
本文介绍了使用Scipy kmeans进行聚类分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想了解 scipy.cluster.vq.kmeans .

I want to understand scipy.cluster.vq.kmeans.

在2D空间中分布有许多点,问题在于将它们分组为簇.我阅读这个问题引起了我的注意,我一直以为scipy.cluster.vq.kmeans是这样去.

Having a number of points distributed in 2D space, the problem is to group them into clusters. This problem came to my attention reading this question and I was thinking that scipy.cluster.vq.kmeans would be way to go.

这是数据:

This is the data:

使用以下代码,目的是获得25个群集中每个群集的中心点.

Using the following code, the aim would be to get the center point of each of the 25 clusters.

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.vq import vq, kmeans, whiten

pos = np.arange(0,20,4)
scale = 0.4
size = 50
x = np.array([np.random.normal(i,scale,size*len(pos)) for i in pos]).flatten()
y = np.array([np.array([np.random.normal(i,scale,size) for i in pos]) for j in pos]).flatten()


plt.scatter(x,y, s=16, alpha=0.4)


#perform clustering with scipy.cluster.vq.kmeans
features = np.c_[x,y]

# take raw data to cluster
clusters = kmeans(features,25)
p = clusters[0]
plt.scatter(p[:,0],p[:,1], s=81, c="crimson")

# perform whitening (normalization to std) first
whitened = whiten(features) 
clustersw = kmeans(whitened,25)
q = clustersw[0]*features.std(axis=0)
plt.scatter(q[:,0],q[:,1], s=25, c="gold")

plt.show()

结果如下:

The result looks like this:

红色点标记群集中心的位置而不会变白,黄色点则使用那些具有白色的点.尽管它们是不同的,但主要的问题是,它们显然并非都处在正确的位置.因为所有集群都很好地分开了,所以我很难理解为什么这个简单的集群失败了.

The red dots mark the location of the cluster centers without whitening, the yellow points those with whitening being used. While they are different, the main problem is that they are obviously not all at the correct position. Because the clusters are all well separated, I'm having trouble to understand why this simple clustering fails.

我阅读了这个问题,该问题报告了kmeans没有给出准确的结果,但是答案并不是真正的统计性的.推荐的将kmeans2minit='points'结合使用的解决方案也不起作用.即kmeans2(features,25, minit='points')给出的结果与上述类似.

I read this question which reports about kmeans not giving accurate results, but the answer is not really statisfactory. The suggested solution to use kmeans2 with minit='points' did not work either; i.e. kmeans2(features,25, minit='points') gives a similar result as the above.

问题是,有没有办法用scipy.cluster.vq.kmeans执行这个简单的聚类问题?如果是这样,我将如何确保获得正确的结果.

So the question would be, is there a way to perform this easy clustering problem with scipy.cluster.vq.kmeans? And if so, how would I make sure to get the correct result.

推荐答案

在这样的数据上,增白不会产生任何影响:您的x和y轴已经按比例缩放.

On data like this, whitening does not make a difference: your x and y axes were already similarly scaled.

K均值不能可靠地找到全局最优值.它倾向于卡在局部最优中.这就是为什么经常使用多次运行并仅保持最佳拟合,并尝试使用复杂的初始化过程(例如k-means ++)的原因.

K-means does not reliably find the global optimum. It tends to get stuck in local optima. That is why it is common to use multiple runs and keep the best fit only, and to experiment with complex initialization procedures like k-means++.

这篇关于使用Scipy kmeans进行聚类分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆