不知道簇数的Kmeans? [英] Kmeans without knowing the number of clusters?
问题描述
我正在尝试在一组高维数据点(大约50个维)上应用k均值,并且想知道是否有任何实现可以找到最佳数量的群集.
我记得在某处读过一种算法通常会这样做的方法,即,使集群间距离最大化而使集群内距离最小,但是我不记得在哪里看到了.如果有人可以指出任何讨论此事的资源,那就太好了.我目前正在将SciPy用于k均值,但任何相关库也都可以.
如果有其他方法可以实现相同或更好的算法,请告诉我.
一种方法是交叉验证.
本质上,您选择数据的一个子集并将其聚类为 k 个聚类,并询问与其他数据相比,聚类的效果如何:您是否正在将数据点分配给相同的集群成员身份,还是属于不同的集群?
如果成员资格大致相同,则数据非常适合 k 个群集.否则,您尝试使用其他 k .
此外,您可以进行PCA(主要成分分析),以将您的50维缩小到一些更易处理的数字.如果PCA运行表明您的大部分方差来自50个维度中的4个,则可以在此基础上选择 k ,以研究如何分配四个集群成员. /p>
I am attempting to apply k-means on a set of high-dimensional data points (about 50 dimensions) and was wondering if there are any implementations that find the optimal number of clusters.
I remember reading somewhere that the way an algorithm generally does this is such that the inter-cluster distance is maximized and intra-cluster distance is minimized but I don't remember where I saw that. It would be great if someone can point me to any resources that discuss this. I am using SciPy for k-means currently but any related library would be fine as well.
If there are alternate ways of achieving the same or a better algorithm, please let me know.
One approach is cross-validation.
In essence, you pick a subset of your data and cluster it into k clusters, and you ask how well it clusters, compared with the rest of the data: Are you assigning data points to the same cluster memberships, or are they falling into different clusters?
If the memberships are roughly the same, the data fit well into k clusters. Otherwise, you try a different k.
Also, you could do PCA (principal component analysis) to reduce your 50 dimensions to some more tractable number. If a PCA run suggests that most of your variance is coming from, say, 4 out of the 50 dimensions, then you can pick k on that basis, to explore how the four cluster memberships are assigned.
这篇关于不知道簇数的Kmeans?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!