是什么使k-medoid中的距离度量“更好"?比k均值? [英] What makes the distance measure in k-medoid "better" than k-means?

查看:180
本文介绍了是什么使k-medoid中的距离度量“更好"?比k均值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读k-均值聚类和k-medoid聚类之间的区别.

I am reading about the difference between k-means clustering and k-medoid clustering.

据推测,在k-medoid算法中使用成对距离度量,而不是更熟悉的平方欧几里德距离类型度量之和来评估我们用k-均值发现的方差,是有优势的.显然,这种不同的距离度量可以以某种方式减少噪声和离群值.

Supposedly there is an advantage to using the pairwise distance measure in the k-medoid algorithm, instead of the more familiar sum of squared Euclidean distance-type metric to evaluate variance that we find with k-means. And apparently this different distance metric somehow reduces noise and outliers.

我已经看过这种说法,但是对于这种说法背后的数学,我还没有看到任何很好的推理.

I have seen this claim but I have yet to see any good reasoning as to the mathematics behind this claim.

是什么使k-medoid中常用的成对距离度量更好?更确切地讲,缺少平方项如何使k-medoids具有与取中位数概念相关的理想属性?

What makes the pairwise distance measure commonly used in k-medoid better? More exactly, how does the lack of a squared term allow k-medoids to have the desirable properties associated with the concept of taking a median?

推荐答案

1. K-medoid更灵活

首先,您可以将k-medoids用于 any 相似性度量.但是,K均值可能无法收敛-实际上,它只能与与 mean 一致的距离一起使用.所以绝对皮尔逊相关系数不能与k-means一起使用,但它与k-medoids一起很好地工作.

1. K-medoid is more flexible

First of all, you can use k-medoids with any similarity measure. K-means however, may fail to converge - it really must only be used with distances that are consistent with the mean. So e.g. Absolute Pearson Correlation must not be used with k-means, but it works well with k-medoids.

其次,k-medoids使用的medoid与 median 大致相当(实际上,也存在k-median,类似于K-means,但曼哈顿距离).如果您查阅有关中位数的文献,您会看到很多解释和示例,为什么中位数对异常值的鲁棒性要强于算术平均值.本质上,这些解释和示例也将适用于类固醇.它是代表点的健壮估计,而不是k均值中的平均值.

Secondly, the medoid as used by k-medoids is roughly comparable to the median (in fact, there also is k-medians, which is like K-means but for Manhattan distance). If you look up literature on the median, you will see plenty of explanations and examples why the median is more robust to outliers than the arithmetic mean. Essentially, these explanations and examples will also hold for the medoid. It is a more robust estimate of a representative point than the mean as used in k-means.

请考虑以下一维示例:

[1, 2, 3, 4, 100000]

该组的中位数和中值都为 3 .平均值是20002.

Both the median and medoid of this set are 3. The mean is 20002.

您认为哪个更能代表数据集?均值具有较低的平方误差,但假设此数据集中可能存在测量误差...

Which do you think is more representative of the data set? The mean has the lower squared error, but assuming that there might be a measurement error in this data set ...

从技术上讲,统计中使用击穿点的概念.中位数的分解点为50%(即,一半的数据点可能不正确,结果仍然不受影响),而平均值的分解点为0(即,单个大观察值可能会产生错误的估计).

Technically, the notion of breakdown point is used in statistics. The median has a breakdown point of 50% (i.e. half of the data points can be incorrect, and the result is still unaffected), whereas the mean has a breakdown point of 0 (i.e. a single large observation can yield a bad estimate).

我没有证据,但我认为类固醇的分解点与中位数相似.

I do not have a proof, but I assume the medoid will have a similar breakdown point as the median.

这是主要缺点.通常,PAM的运行时间比k均值要长得多.由于涉及计算所有成对距离,因此它为O(n^2*k*i);而k-means在O(n*k*i)中运行,通常,k倍的迭代次数是k*i << n.

That's the main drawback. Usually, PAM takes much longer to run than k-means. As it involves computing all pairwise distances, it is O(n^2*k*i); whereas k-means runs in O(n*k*i) where usually, k times the number of iterations is k*i << n.

这篇关于是什么使k-medoid中的距离度量“更好"?比k均值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆