在Python中无法解决kmeans和kmeans2集群问题 [英] Trouble with scipy kmeans and kmeans2 clustering in Python

查看:107
本文介绍了在Python中无法解决kmeans和kmeans2集群问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对scipy的kmeanskmeans2有疑问.我有一组1700个经纬度数据点.我想将它们在空间上分为100个集群.但是,使用kmeanskmeans2时,得到的结果截然不同.你能解释为什么吗?我的代码在下面.

I have a question about scipy's kmeans and kmeans2. I have a set of 1700 lat-long data points. I want to spatially cluster them into 100 clusters. However, I get drastically different results when using kmeans vs kmeans2. Can you explain why this is? My code is below.

首先,我加载数据并绘制坐标.看起来都很正确.

First I load my data and plot the coordinates. It all looks correct.

import pandas as pd, numpy as np, matplotlib.pyplot as plt
from scipy.cluster.vq import kmeans, kmeans2, whiten

df = pd.read_csv('data.csv')
df.head()

coordinates = df.as_matrix(columns=['lon', 'lat'])
plt.figure(figsize=(10, 6), dpi=100)
plt.scatter(coordinates[:,0], coordinates[:,1], c='c', s=100)
plt.show()

接下来,我将数据变白并运行kmeans()kmeans2().当我从kmeans()绘制质心时,它看起来是正确的-即大约100个点或多或少代表了完整的1700点数据集的位置.

Next, I whiten the data and run kmeans() and kmeans2(). When I plot the centroids from kmeans(), it looks about right - i.e. approximately 100 points that more or less represent the locations of the full 1700 point data set.

N = len(coordinates)
w = whiten(coordinates)
k = 100
i = 20

cluster_centroids1, distortion = kmeans(w, k, iter=i)
cluster_centroids2, closest_centroids = kmeans2(w, k, iter=i)

plt.figure(figsize=(10, 6), dpi=100)
plt.scatter(cluster_centroids1[:,0], cluster_centroids1[:,1], c='r', s=100)
plt.show()

但是,当我下一次绘制kmeans2()中的质心时,对我来说似乎完全不了解.我希望kmeanskmeans2的结果相当相似,但是它们是完全不同的.尽管kmeans的结果似乎确实代表了我的完整数据集,但kmeans2的结果却几乎是随机的.

However, when I next plot the centroids from kmeans2(), it looks totally wonky to me. I would expect the results from kmeans and kmeans2 to be fairly similar, but they are completely different. While the result from kmeans does appear to simply yet represent my full data set, the result from kmeans2 looks nearly random.

plt.figure(figsize=(10, 6), dpi=100)
plt.scatter(cluster_centroids2[:,0], cluster_centroids2[:,1], c='r', s=100)
plt.show()

这是我的k和N值,以及kmeans()kmeans2()产生的数组大小:

Here are my values for k and N, along with the size of the arrays resulting from kmeans() and kmeans2():

print 'k =', k
print 'N =', N
print len(cluster_centroids1)
print len(cluster_centroids2)
print len(closest_centroids)
print len(np.unique(closest_centroids))

输出:

k = 100
N = 1759
96
100
1759
17

  • 为什么len(cluster_centroids1)不等于k?
  • len(closest_centroids)等于N,这似乎是正确的.但是为什么len(np.unique(closest_centroids))不等于k?
  • len(cluster_centroids2)等于k,但是再次绘制时,cluster_centroids2似乎不像cluster_centroids1那样代表原始数据集.
    • Why would len(cluster_centroids1) not be equal to k?
    • len(closest_centroids) is equal to N, which seems correct. But why would len(np.unique(closest_centroids)) not be equal to k?
    • len(cluster_centroids2) is equal to k, but again, when plotted, cluster_centroids2 doesn't seem to represent the original data set the way cluster_centroids1 does.
    • 最后,我绘制了完整的坐标数据集,并按簇着色.

      Lastly, I plot my full coordinate data set, colored by cluster.

      plt.figure(figsize=(10, 6), dpi=100)
      plt.scatter(coordinates[:,0], coordinates[:,1], c=closest_centroids, s=100)
      plt.show()
      

      您可以在这里看到它:

      You can see it here:

      推荐答案

      感谢您提供示例代码和图片的好问题!这是一个很好的新手问题.

      Thank you for the good question with the sample code and images! This is a good newbie question.

      大多数特性可以通过仔细阅读文档来解决.一些事情:

      Most of the peculiarities can be solved by careful reading of the docs. A few things:

      • 在比较原始点集和生成的聚类中心时,应尝试将它们绘制在具有相同尺寸的同一图中(即w再次显示结果).例如,绘制完成时用大点表示的聚类中心,并用小点表示原始数据.

      • When comparing the original set of points and the resulting cluster centers, you should try and plot them in the same plot with the same dimensions (i.e., w agains the results). For example, plot the cluster centers with the large dots as you've done and original data with small dots on top of it.

      kmeanskmeans2从不同的情况开始. kmeans2从点的随机分布开始,并且由于您的数据分布不均匀,因此kmeans2收敛为非理想结果.您可以尝试添加关键字minit='points'并查看结果是否更改.

      kmeans and kmeans2 start from different situation. kmeans2 starts from random distribution of points, and as your data is not evenly distributed, kmeans2 converges into a non-ideal result. You might try to add keyword minit='points' and see if the results change.

      由于最初的质心选择很糟糕,因此最初的100个质心中实际上只有17个具有属于它们的点(这与图形的随机外观密切相关).

      As the initial centroid choice is a bad one, only 17 of the initial 100 centroids actually have any points belonging to them (this is closely related to the random look of the graph).

      如果kmeans中的某些形心畸变最小,则它们可能会相互折叠. (这似乎没有被记录.)因此,您将仅获得96个质心.

      It seems that some centroids in kmeans may collapse into each other if that gives the smallest distortion. (This does not seem tp be documented.) Thus you will get only 96 centroids.

      这篇关于在Python中无法解决kmeans和kmeans2集群问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆