在Python中无法解决kmeans和kmeans2集群问题 [英] Trouble with scipy kmeans and kmeans2 clustering in Python
问题描述
我对scipy的kmeans
和kmeans2
有疑问.我有一组1700个经纬度数据点.我想将它们在空间上分为100个集群.但是,使用kmeans
与kmeans2
时,得到的结果截然不同.你能解释为什么吗?我的代码在下面.
I have a question about scipy's kmeans
and kmeans2
. I have a set of 1700 lat-long data points. I want to spatially cluster them into 100 clusters. However, I get drastically different results when using kmeans
vs kmeans2
. Can you explain why this is? My code is below.
首先,我加载数据并绘制坐标.看起来都很正确.
First I load my data and plot the coordinates. It all looks correct.
import pandas as pd, numpy as np, matplotlib.pyplot as plt
from scipy.cluster.vq import kmeans, kmeans2, whiten
df = pd.read_csv('data.csv')
df.head()
coordinates = df.as_matrix(columns=['lon', 'lat'])
plt.figure(figsize=(10, 6), dpi=100)
plt.scatter(coordinates[:,0], coordinates[:,1], c='c', s=100)
plt.show()
接下来,我将数据变白并运行kmeans()
和kmeans2()
.当我从kmeans()
绘制质心时,它看起来是正确的-即大约100个点或多或少代表了完整的1700点数据集的位置.
Next, I whiten the data and run kmeans()
and kmeans2()
. When I plot the centroids from kmeans()
, it looks about right - i.e. approximately 100 points that more or less represent the locations of the full 1700 point data set.
N = len(coordinates)
w = whiten(coordinates)
k = 100
i = 20
cluster_centroids1, distortion = kmeans(w, k, iter=i)
cluster_centroids2, closest_centroids = kmeans2(w, k, iter=i)
plt.figure(figsize=(10, 6), dpi=100)
plt.scatter(cluster_centroids1[:,0], cluster_centroids1[:,1], c='r', s=100)
plt.show()
但是,当我下一次绘制kmeans2()
中的质心时,对我来说似乎完全不了解.我希望kmeans
和kmeans2
的结果相当相似,但是它们是完全不同的.尽管kmeans
的结果似乎确实代表了我的完整数据集,但kmeans2
的结果却几乎是随机的.
However, when I next plot the centroids from kmeans2()
, it looks totally wonky to me. I would expect the results from kmeans
and kmeans2
to be fairly similar, but they are completely different. While the result from kmeans
does appear to simply yet represent my full data set, the result from kmeans2
looks nearly random.
plt.figure(figsize=(10, 6), dpi=100)
plt.scatter(cluster_centroids2[:,0], cluster_centroids2[:,1], c='r', s=100)
plt.show()
这是我的k和N值,以及kmeans()
和kmeans2()
产生的数组大小:
Here are my values for k and N, along with the size of the arrays resulting from kmeans()
and kmeans2()
:
print 'k =', k
print 'N =', N
print len(cluster_centroids1)
print len(cluster_centroids2)
print len(closest_centroids)
print len(np.unique(closest_centroids))
输出:
k = 100
N = 1759
96
100
1759
17
- 为什么
len(cluster_centroids1)
不等于k
? -
len(closest_centroids)
等于N
,这似乎是正确的.但是为什么len(np.unique(closest_centroids))
不等于k
? -
len(cluster_centroids2)
等于k
,但是再次绘制时,cluster_centroids2
似乎不像cluster_centroids1
那样代表原始数据集. - Why would
len(cluster_centroids1)
not be equal tok
? len(closest_centroids)
is equal toN
, which seems correct. But why wouldlen(np.unique(closest_centroids))
not be equal tok
?len(cluster_centroids2)
is equal tok
, but again, when plotted,cluster_centroids2
doesn't seem to represent the original data set the waycluster_centroids1
does.
最后,我绘制了完整的坐标数据集,并按簇着色.
Lastly, I plot my full coordinate data set, colored by cluster.
plt.figure(figsize=(10, 6), dpi=100)
plt.scatter(coordinates[:,0], coordinates[:,1], c=closest_centroids, s=100)
plt.show()
您可以在这里看到它:
You can see it here:
推荐答案
感谢您提供示例代码和图片的好问题!这是一个很好的新手问题.
Thank you for the good question with the sample code and images! This is a good newbie question.
大多数特性可以通过仔细阅读文档来解决.一些事情:
Most of the peculiarities can be solved by careful reading of the docs. A few things:
-
在比较原始点集和生成的聚类中心时,应尝试将它们绘制在具有相同尺寸的同一图中(即
w
再次显示结果).例如,绘制完成时用大点表示的聚类中心,并用小点表示原始数据.
When comparing the original set of points and the resulting cluster centers, you should try and plot them in the same plot with the same dimensions (i.e.,
w
agains the results). For example, plot the cluster centers with the large dots as you've done and original data with small dots on top of it.
kmeans
和kmeans2
从不同的情况开始. kmeans2
从点的随机分布开始,并且由于您的数据分布不均匀,因此kmeans2
收敛为非理想结果.您可以尝试添加关键字minit='points'
并查看结果是否更改.
kmeans
and kmeans2
start from different situation. kmeans2
starts from random distribution of points, and as your data is not evenly distributed, kmeans2
converges into a non-ideal result. You might try to add keyword minit='points'
and see if the results change.
由于最初的质心选择很糟糕,因此最初的100个质心中实际上只有17个具有属于它们的点(这与图形的随机外观密切相关).
As the initial centroid choice is a bad one, only 17 of the initial 100 centroids actually have any points belonging to them (this is closely related to the random look of the graph).
如果kmeans
中的某些形心畸变最小,则它们可能会相互折叠. (这似乎没有被记录.)因此,您将仅获得96个质心.
It seems that some centroids in kmeans
may collapse into each other if that gives the smallest distortion. (This does not seem tp be documented.) Thus you will get only 96 centroids.
这篇关于在Python中无法解决kmeans和kmeans2集群问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!