不知道簇数的Kmeans? [英] Kmeans without knowing the number of clusters?

查看:145
本文介绍了不知道簇数的Kmeans?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在一组高维数据点(大约50个维)上应用k均值,并且想知道是否有任何实现可以找到最佳数量的群集.

我记得在某处读过一种算法通常会这样做的方法,即,使集群间距离最大化而使集群内距离最小,但是我不记得在哪里看到了.如果有人可以指出任何讨论此事的资源,那就太好了.我目前正在将SciPy用于k均值,但任何相关库也都可以.

如果有其他方法可以实现相同或更好的算法,请告诉我.

解决方案

一种方法是交叉验证.

本质上,您选择数据的一个子集并将其聚类为 k 个聚类,并询问与其他数据相比,聚类的效果如何:您是否正在将数据点分配给相同的集群成员身份,还是属于不同的集群?

如果成员资格大致相同,则数据非常适合 k 个群集.否则,您尝试使用其他 k .

此外,您可以进行PCA(主要成分分析),以将您的50维缩小到一些更易处理的数字.如果PCA运行表明您的大部分方差来自50个维度中的4个,则可以在此基础上选择 k ,以研究如何分配四个集群成员. /p>

I am attempting to apply k-means on a set of high-dimensional data points (about 50 dimensions) and was wondering if there are any implementations that find the optimal number of clusters.

I remember reading somewhere that the way an algorithm generally does this is such that the inter-cluster distance is maximized and intra-cluster distance is minimized but I don't remember where I saw that. It would be great if someone can point me to any resources that discuss this. I am using SciPy for k-means currently but any related library would be fine as well.

If there are alternate ways of achieving the same or a better algorithm, please let me know.

解决方案

One approach is cross-validation.

In essence, you pick a subset of your data and cluster it into k clusters, and you ask how well it clusters, compared with the rest of the data: Are you assigning data points to the same cluster memberships, or are they falling into different clusters?

If the memberships are roughly the same, the data fit well into k clusters. Otherwise, you try a different k.

Also, you could do PCA (principal component analysis) to reduce your 50 dimensions to some more tractable number. If a PCA run suggests that most of your variance is coming from, say, 4 out of the 50 dimensions, then you can pick k on that basis, to explore how the four cluster memberships are assigned.

这篇关于不知道簇数的Kmeans?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆