实施Elbow方法以找到R中K均值聚类的最佳聚类数 [英] Implementing the Elbow Method for finding the optimum number of clusters for K-Means Clustering in R

查看:623
本文介绍了实施Elbow方法以找到R中K均值聚类的最佳聚类数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想对数据集使用K均值聚类.我正在R中使用kmeans()函数来做到这一点.

I want to use K-Means Clustering for my dataset. I am using the kmeans() function in R for doing this.

 k<-kmeans(data,centers=3)
 plotcluster(m,k$cluster)

但是我不确定此函数的K正确值是多少.我想尝试使用弯头方法. R中是否有使用Elbow方法执行聚类以找到最佳聚类数的程序包.

However i am not sure what is the correct value of K for this function. I want to try using the Elbow Method for this. Are there any packages in R which perform clustering using the Elbow Method for finding the optimum number of clusters.

推荐答案

这里有两个问题.一种是如何在曲线上找到变化点,另一种是有关在使用k均值对数据进行分类时如何量化拟合质量.但是,聚类分析人员似乎将这两个问题放在一起.不要担心使用其他最适合您情况的拟合指标来研究其他曲线拟合/更改点方法.

There are two question mixed up here. One is how to find a change point on a curve, and the other is about how to quantify the quality of fit when using k-means to classify data. However, the cluster-analysis folks seem to lump these two questions together. Don't be afraid of looking into other curve-fit / change point methods using whichever fit metric seems most appropriate to your case.

我知道您链接到的肘"方法是特定的方法,但是您可能会对在贝叶斯信息标准(BIC)中寻找膝盖"的类似东西感兴趣. BIC中的纠结与群集数(k)的关系是您可以争论的一点,考虑到更复杂的解决方案的额外计算要求,通过添加更多群集来增加BIC不再是有益的.有一种很好的方法可以根据BIC的二阶导数符号的变化来检测集群的最佳数量.参见例如

I know the 'elbow' method your linked to is a specific method, but you might be interested in something similar that looks for the 'knee' in the Bayesian Information Criterion (BIC). The kink in BIC versus the number of clusters (k) is the point at which you can argue that increasing BIC by adding more clusters is no longer beneficial, given the extra computational requirements of the more complex solution. There is a nice method that detects the optmimum number of clusters from the change in sign of the second derivative of the BIC. See e.g.

Zhao,Q.,V.Hautamaki和P. Franti 2008a:BIC中的拐点检测,用于检测簇数. 《智能视觉系统的高级概念》,J.Blanc-Talon,S.Bourennane,W.Philips,D.Popescu和P. Scheunders等编辑,Springer Berlin/Heidelberg,计算机科学讲座,第1卷. 5259,664–673,doi:10.1007/978-3-540-88458-3 60.

Zhao, Q., V. Hautamaki, and P. Franti 2008a: Knee point detection in BIC for detecting the number of clusters. Advanced Concepts for Intelligent Vision Systems, J. Blanc-Talon, S. Bourennane, W. Philips, D. Popescu, and P. Scheunders, Eds., Springer Berlin / Heidelberg, Lecture Notes in Computer Science, Vol. 5259, 664–673, doi:10.1007/978-3-540-88458-3 60.

Zhao Q.,M.Xu和P. Franti,2008b:基于贝叶斯信息准则的拐点检测.人工智能工具,2008年.ICTAI’08.第20届IEEE国际会议,第1卷. 2,431 –438,doi:10.1109/ICTAI.2008.154

Zhao, Q., M. Xu, and P. Franti, 2008b: Knee point detection on bayesian information criterion. Tools with Artificial Intelligence, 2008. ICTAI ’08. 20th IEEE Inter- national Conference on, Vol. 2, 431 –438, doi:10.1109/ ICTAI.2008.154

您可能会对自动将其应用于天气数据感兴趣,该数据在 http://journals.ametsoc.org/doi/abs/10.1175/JAMC-D-11-0227.1

You might be interested in an automated application of this to weather data, reported in http://journals.ametsoc.org/doi/abs/10.1175/JAMC-D-11-0227.1

另请参见在一个曲线,很好地讨论了通用方法.

See also Finding the best trade-off point on a curve for an excellent discussion of the general approach.

最后一个观察结果:确保对数一致.不同的社区使用不同的符号,这在比较结果时可能是错误的来源.

One final observation: make sure that you are consistent in your logarithms. Different communities use different notation, and this can be a source of error when comparing results.

这篇关于实施Elbow方法以找到R中K均值聚类的最佳聚类数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆