在K均值算法(MATLAB)中将绝对Pearson相关用作距离 [英] Use Absolute Pearson Correlation as Distance in K-Means Algorithm (MATLAB)

查看:105
本文介绍了在K均值算法(MATLAB)中将绝对Pearson相关用作距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用相关距离进行一些聚类,但不是使用定义为d = 1-ri的内置``距离''``相关'',而是需要绝对皮尔逊距离.获得相同的混乱ID.现在,当使用kmeans()函数时,我想获得高度反相关的质心,我想通过将它们组合在一起来避免.现在,我在Matlab上还不太流利,并且在读取kmeans函数时遇到了一些问题.出于我的目的可以对其进行编辑吗?

示例:

使用相关距离作为度量标准时,行1和行2应该具有相同的群集ID.

我做了一些尝试来编辑内置的matlab函数(打开kmeans->行775) 但是奇怪的是-当我更改距离函数时,我得到一个有效的距离矩阵,但簇索引错误,无法找到原因. 希望获得一些提示!一切顺利!

解决方案

这是一个很好的例子,说明为什么您不应该将k-means与其他距离函数一起使用.

k-means不会 最小化距离.它使一维偏差平方和(ems )(SSQ)最小化.

从数学上讲,它等于欧几里德距离的平方,因此,它确实使 Euclidean 距离最小,这是数学上的副作用.它不会最小化任意其他距离,这不等于方差最小化.

对于您来说,很高兴看到失败的原因;作为示例,我必须记住这一点.

您可能知道,k均值(即劳埃德)由两个步骤组成:按最小平方偏差进行赋值,然后重新计算均值.

现在的问题是,重新计算均值与绝对皮尔逊相关性不一致.

让我们采用两个与-1相关的向量:

+1 +2 +3 +4 +5
-1 -2 -3 -4 -5

并计算平均值:

 0  0  0  0  0

景气.它们与平均值无关.实际上,对于这个向量,Pearson相关性甚至不再是定义明确的,因为它的方差为零...

为什么会这样?因为您将k-means误认为是基于距离的.实际上是基于算术平均值的.算术平均值是最小二乘(!!)估计量-最小化平方差的和.这就是为什么平方欧几里得距离有效的原因:它与重新计算均值一样优化了相同的数量.同时在两个步骤中优化同一目标使算法收敛.

另请参阅以下反地球距离的反例,其中k均值的平均步长为次优的结果(尽管可能不如绝对皮尔逊那样糟糕)

考虑使用 k-medoids aka PAM,而不是使用k-means,可以在任意距离下工作.或许多其他群集算法之一,包括 DBSCAN 解决方案

This is a good example of why you should not use k-means with other distance functions.

k-means does not minimize distances. It minimizes the sum of squared 1-dimensional deviations (SSQ).

Which is mathematically equivalent to squared Euclidean distance, so it does minimize Euclidean distances, as a mathematical side effect. It does not minimize arbitrary other distances, which are not equivalent to variance minimization.

In your case, it's pretty nice to see why it fails; I have to remember this as a demo case.

As you may know, k-means (Lloyds, that is) consists of two steps: assign by minimum squared deviation and then recompute the means.

Now the problem is, recomputing the mean is not consistent with absolute pearson correlation.

Let's take two of your vectors, which are -1 correlated:

+1 +2 +3 +4 +5
-1 -2 -3 -4 -5

and compute the mean:

 0  0  0  0  0

Boom. They are not at all correlated to their mean. In fact, Pearson correlation is not even well-defined for this vector anymore, because it has zero variance...

Why does this happen? Because you misinterpreted k-means as distance based. It's actually as much arithmetic mean based. The arithmetic mean is a least-squares (!!) estimator - it minimizes the sum of squared deviations. And that is why squared Euclidean distance works: it optimizes the same quantity as recomputing the mean. Optimizing the same objective in both steps makes the algorithm converge.

See also this counter-example for Earth-movers distance, where the mean step of k-means yields suboptimal results (although probably not as bad as with absolute pearson)

Instead of using k-means, consider using k-medoids aka PAM, which does work for arbitrary distances. Or one of the many other clustering algorithms, including DBSCAN and OPTICS.

这篇关于在K均值算法(MATLAB)中将绝对Pearson相关用作距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆