在Python中无法解决kmeans和kmeans2集群问题 [英] Trouble with scipy kmeans and kmeans2 clustering in Python

查看：107 发布时间：2020/4/26 10:22:54 python scipy cluster-analysis geospatial k-means

本文介绍了在Python中无法解决kmeans和kmeans2集群问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对scipy的kmeans和kmeans2有疑问.我有一组1700个经纬度数据点.我想将它们在空间上分为100个集群.但是，使用kmeans与kmeans2时，得到的结果截然不同.你能解释为什么吗?我的代码在下面.

I have a question about scipy's kmeans and kmeans2. I have a set of 1700 lat-long data points. I want to spatially cluster them into 100 clusters. However, I get drastically different results when using kmeans vs kmeans2. Can you explain why this is? My code is below.

首先，我加载数据并绘制坐标.看起来都很正确.

First I load my data and plot the coordinates. It all looks correct.

import pandas as pd, numpy as np, matplotlib.pyplot as plt
from scipy.cluster.vq import kmeans, kmeans2, whiten

df = pd.read_csv('data.csv')
df.head()

coordinates = df.as_matrix(columns=['lon', 'lat'])
plt.figure(figsize=(10, 6), dpi=100)
plt.scatter(coordinates[:,0], coordinates[:,1], c='c', s=100)
plt.show()

接下来，我将数据变白并运行kmeans()和kmeans2().当我从kmeans()绘制质心时，它看起来是正确的-即大约100个点或多或少代表了完整的1700点数据集的位置.

Next, I whiten the data and run kmeans() and kmeans2(). When I plot the centroids from kmeans(), it looks about right - i.e. approximately 100 points that more or less represent the locations of the full 1700 point data set.

N = len(coordinates)
w = whiten(coordinates)
k = 100
i = 20

cluster_centroids1, distortion = kmeans(w, k, iter=i)
cluster_centroids2, closest_centroids = kmeans2(w, k, iter=i)

plt.figure(figsize=(10, 6), dpi=100)
plt.scatter(cluster_centroids1[:,0], cluster_centroids1[:,1], c='r', s=100)
plt.show()

但是，当我下一次绘制kmeans2()中的质心时，对我来说似乎完全不了解.我希望kmeans和kmeans2的结果相当相似，但是它们是完全不同的.尽管kmeans的结果似乎确实代表了我的完整数据集，但kmeans2的结果却几乎是随机的.

However, when I next plot the centroids from kmeans2(), it looks totally wonky to me. I would expect the results from kmeans and kmeans2 to be fairly similar, but they are completely different. While the result from kmeans does appear to simply yet represent my full data set, the result from kmeans2 looks nearly random.

plt.figure(figsize=(10, 6), dpi=100)
plt.scatter(cluster_centroids2[:,0], cluster_centroids2[:,1], c='r', s=100)
plt.show()

这是我的k和N值，以及kmeans()和kmeans2()产生的数组大小:

Here are my values for k and N, along with the size of the arrays resulting from kmeans() and kmeans2():

print 'k =', k
print 'N =', N
print len(cluster_centroids1)
print len(cluster_centroids2)
print len(closest_centroids)
print len(np.unique(closest_centroids))

输出:

为什么len(cluster_centroids1)不等于k?
len(closest_centroids)等于N，这似乎是正确的.但是为什么len(np.unique(closest_centroids))不等于k?
len(cluster_centroids2)等于k，但是再次绘制时，cluster_centroids2似乎不像cluster_centroids1那样代表原始数据集.

Why would len(cluster_centroids1) not be equal to k?
len(closest_centroids) is equal to N, which seems correct. But why would len(np.unique(closest_centroids)) not be equal to k?
len(cluster_centroids2) is equal to k, but again, when plotted, cluster_centroids2 doesn't seem to represent the original data set the way cluster_centroids1 does.

最后，我绘制了完整的坐标数据集，并按簇着色.

Lastly, I plot my full coordinate data set, colored by cluster.

plt.figure(figsize=(10, 6), dpi=100)
plt.scatter(coordinates[:,0], coordinates[:,1], c=closest_centroids, s=100)
plt.show()

您可以在这里看到它:

You can see it here:

在Python中无法解决kmeans和kmeans2集群问题 [英] Trouble with scipy kmeans and kmeans2 clustering in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在Python中无法解决kmeans和kmeans2集群问题 [英] Trouble with scipy kmeans and kmeans2 clustering in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭