优化K均值算法 [英] Optimizing K-means algorithm

查看:69
本文介绍了优化K均值算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试遵循名为 K-Means算法的优化版本.我对K均值算法工作原理有一个想法.也就是说,将元组/点分组为簇并更新质心.

Am trying to follow a paper called An Optimized Version of K-Means Algorithm. I have the idea on how K-Means algorithm works. That is, grouping the tuples/points into clusters and updating the centroids.

正在尝试实现上述论文中提到的方法.他们提出的算法是这样的:

Am trying to implement the method mentioned in the above paper. Their proposed algorithm is this:

所以我的疑问在于第二步.我不知道那里正在做什么!在论文中说,我们根据e的值将数据分组为更宽的间隔,这样以后我们就可以避免遍历整个数据集.那么,实际上我们如何将其存储在I(间隔)中?我们应该定义一个多维数组吗?这对我来说没有任何意义(可能很愚蠢,无法理解这个想法).

So my doubt is in the second step. I didn't understood what it is being done there! In the paper it says that, we group our data to wider intervals based on the value of e, so that later we could avoid iterating through the entire dataset. So, actually how we store it in that I (intervals) ? Are we supposed to define a multidimensional array? It's not making sense to me(probably am dumb enough not to get the idea).

接下来我要怀疑的是步骤5.在那里,Cw被认为是该点的最接近质心.但是我们如何解决呢?首先,我们将随机分配一个点作为质心.因此,在计算e之前,我们是否应该遍历所有点并找出Cw(最接近的质心)?

The next thing I have doubt is about the step 5. In there, Cw is said to be the closest centroid of that point. But how do we figure that out? At first, we would be randomly assigning a point as the centroid. So, are we supposed to loop through the points and find out the Cw (the closest centroid) before calculating the e ?

下一个疑问是第6步,我想在我对第2步的上一个问题有了想法之后就能理解.

The next doubt is on Step 6, which I guess will be able to understand after I get the idea about my previous question regarding Step2.

最后的疑问是关于Step7. CjCj'是什么意思?前一个质心位置与更新后的质心位置之间的距离?

And the final doubt is regarding Step7. What does it mean by CjCj' ? Distance between the previous centroid position to the updated centroid position ?

整天以来,我一直在为此进行头脑风暴.任何线索将不胜感激.谢谢

I have been brain storming about this since the whole day. Any clues would be highly appreciated. Thank you

推荐答案

此算法围绕以下思想展开:靠近一个聚类中心并远离所有其他聚类中心的点将坚持该聚类中心.因此,在随后的迭代中,不必将这些点与所有聚类中心进行比较.

This algorithm revolves around the idea that points which are close to one cluster center and further away from all other cluster centers will stick to that cluster center. Thus, in subsequent iterations these points don't have to be compared to all cluster centers.

想象一个点P,该点P与其指定的聚类的距离为3(即最近的距离),而所有其他聚类的距离至少为8或更多.现在,计算新的群集中心.假设所有移动的聚类中心最多为2(即算法第7行中的D值).

Imagine a point P which has distance 3 to its assigned cluster (i.e. that is the closest distance) and all other clusters are at least 8 or more away. Now, compute the new cluster centers. Let's say that the most any cluster center moved was 2 (that is the value of D in the algorithm, line 7).

现在,您可以确保您的点仍属于同一群集.如果考虑最坏的情况,这很容易理解:分配的群集可能已经偏离该点,因此最坏的情况下可能会有3 + 2的新距离.最近的其他星团可能已经移向该点,因此其距离现在可能是8-2.因此,您不需要更新这一点.

You can now be sure that your point still belongs to the same cluster. This is easy to understand if you think about the worst case scenario: The assigned cluster could have moved away from the point so it could be at worst have a new distance of 3+2. The closest other cluster could have moved towards the point so its distance could be now 8-2. Therefore, you don't need to update that point.

这是一张图片:

第2步:创建时间间隔,稍后将其放入这些点.在第5步中,您将创建e值.如果e为5,则将该点放入间隔[4,6)(如果有该间隔).

Step 2: Create Intervals into which you later put the points. In step 5 you'll create e values. If e is 5, you'd put that point into the interval [4, 6) (if you'd have that interval).

第5步:计算点到所有聚类的距离.最接近的群集是Cw.如果下一个最接近的群集是C3而不是e = C3 - Cw.我称e为安全裕度.它为您提供了群集在分配切换之前可以移动的最大距离.

Step 5: Compute the distance of the point to all clusters. The closest cluster is Cw. If the next closest cluster is C3 than e = C3 - Cw. I'd call e the safety margin. It gives you the maximal distance a cluster can move before assignments might switch.

第6步:在第2步中进行了解释.

Step 6: explained with step 2.

步骤7:CjCj'Cj在更新时移动的距离.

Step 7: CjCj' is the distance that Cj moved when it was updated.

步骤7& 8:这些是作为for循环的一部分完成的. (在第9步中,他们说回到4").

Step 7 & 8: These are done as part of the for loop. (In Step 9 they say "go back to 4").

第9步:使用可能已更改聚类的所有点继续循环.

Step 9: Continue the loop with all points that might have changed clusters.

此算法假设您知道k均值.我相信这是一个非常公平的假设.何时终止以及如何初始化算法不是很明确.我认为这两个事实都是常识.

This algorithm assumes that you know k-means. I believe this is a very fair assumption. It is not very explicit when to terminate and how to initialise the algorithm. Both of these facts I'd consider common knowledge.

  • 通过为每个群集中心分配一个随机点来初始化群集中心.存在其他初始化例程.该算法与该部分算法无关.
  • 当不再有标签小于0的间隔时终止算法.这等效于k均值中的正常终止标准:没有点分配给其他聚类时停止.

这篇关于优化K均值算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆