K均值算法 [英] K-Means Algorithm

查看:79
本文介绍了K均值算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可能的重复项:
如何在K中优化K-均值算法
在使用k-means聚类时如何确定k?

Possible Duplicates:
How to optimal K in K - Means Algorithm
How do I determine k when using k-means clustering?

根据统计指标,我们可以决定K.例如标准差,均值,方差等, 或者

Depending on the statistical measures can we decide on the K. Like Standard Deviation, Mean, Variance etc., Or

有什么简单的方法可以在K均值算法中选择K?

Is there any simple method to choose the K in K-means Algorithm?

先谢谢了 纳文

推荐答案

如果您明确想使用k-means,则可以研究描述均值移位聚类算法.

If you explicitly want to use k-means you could study the article describing x-means. When using an implementation of x-means the only difference compared to k-means, is that rather than specifying a single k, you specify a range for k. The "best" choice, wrt. some measure, in the range will be part of the output from x-means. You can also look into the Mean Shift clustering algorithm.

如果使用给定的数据在计算上可行(可能按照yura建议的那样使用采样),则可以使用各种k进行聚类,并使用一些标准的聚类有效性度量来评估所得聚类的质量.此处介绍了一些经典措施:措施.

If it is computationally feasible with your given data (possibly using sampling as yura suggests), you could do clustering with various k's and evalute the quality of the resulting clusters using some of the standard cluster validity measures. Some of the classic measures are described here: measures.

@doug 在群集分配开始之前,k-means ++确定群集数量的最佳k是不正确的. k-means ++与k-means的不同之处仅在于,它不是随机选择初始的k个质心,而是随机选择一个初始的质心,并依次选择中心,直到选择了k.在最初的完全随机选择之后,数据点被选择为新质心,其概率由潜在函数确定,该函数取决于数据点到已选择中心的距离. k-means ++的标准参考是 k-means ++:精心播种的优势由Arthur和Vassilvitskii撰写.

@doug It is not correct that k-means++ determines an optimal k for the number of clusters before cluster assignments start. k-means++ differs from k-means only by instead of randomly choosing the initial k centroids, it chooses one initial centroid randomly and successively chooses centers until k has been chosen. After the initial completely random choice, data points are chosen as a new centroid with a probability that is determined by a potential function which depends on the datapoint's distance to the already chosen centers. The standard reference for k-means++ is k-means++: The Advantages of Careful Seeding by Arthur and Vassilvitskii.

此外,我不认为通常选择k作为主成分的数量会改善您的聚类.想象一下三维空间中的所有数据点都位于通过origo的平面中.然后,您将获得2个主成分,但是这些点的自然"聚类可以具有任意数量的聚类.

Also, I don't think that in general choosing k to be the number of principal components will improve your clustering. Imagine data points in three-dimensional space all lying in a plane passing through the origo. You will then get 2 principal components, but the "natural" clustering of the points could have any number of clusters.

这篇关于K均值算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆