聚类算法与最小尺寸的限制 [英] Algorithm for clustering with minimum size constraints

查看:328
本文介绍了聚类算法与最小尺寸的限制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组数据聚类成的 K 组,每个集群都有一个最小尺寸限制的 M

我已经做了一些数据进行重新建。所以,现在我得到这个组每个人都有一个或多个更好的集群是在点的,但不能单独打开,因为它会违反大小限制。

目标的:从每个点到其聚类中心最小距离的总和

符合的:最小簇尺寸M

我想找到一个算法来重新分配所有的点在不违反约束,同时保证降低目标。

我想用图表来重新点之间present成对的关系。但我不知道如何,因为它存在一个大的密集周期的可能性做了重新分配,而我失去了在多个集群之间交换多点。

我还创建集群对可能出现的交换候选人名单,但还是没能找到最优目标的方式。

我希望我解释了我的处境。我是新来的算法,和不熟悉的专业术语和规则。如果需要任何其他信息,请让我知道。

我已经做了很多的研究, 我已经尝试了本文算法,但都没有成功,因为隶属度之和并不一定与簇的大小。 <一href="http://www.researchgate.net/profile/Frank_Klawonn/publication/268292668_Clustering_with_Size_Constraints/links/54c63afc0cf219bbe4f71bf9.pdf"相对=nofollow>集群与尺寸限制

我也看到在SO其他类似的帖子,但没有找到一个详细的算法我可以实现。

我试图构建一个权有向图,顶点重presenting集群和边从A到B再presents分集A愿意移居到集B和重量是总和点

但是,随着我的数据,事实证明,所有的节​​点是在一个巨大的循环非常密集的边缘。因为我有限的经验,我仍然无法弄清楚如何这么多簇间重新分配。任何的建议是AP preciated!

这样的事情。

解决方案

要获得一个最小的(可惜不是最小)的解决方案:

  1. 首先,贪婪地重新群集的任何点,你可以在不违反 最小尺寸的限制。
  2. 接下来,建立一个有向重图如下:
    1. 在每个集群成为一个节点。
    2. 的边缘(A,B)加入的对于每一个点的所述的更靠近B的中心(以便有相同的一对节点之间潜在的多条边);其重量应正比于由移动它获得的益处。
  3. 寻找在该图中周期将让你找到移动 (其中一招是由移动在周期的每个顶点的)。
  4. 选择周期的最高总重量,重新群集对应的边缘节点。
  5. 在重复步骤1-4,直到有没有更多的循环。

创建图形将具有复杂性为O(千牛),在这里有k和n的总积分,并可以创建multiedges相同数量。的Tarjan的算法将有复杂性为O(K 2 ),假设你跳过multiedges在DFS同一目的地。可以消除一个循环时,都降低全局距离一定量,并从图中删除的至少一个边缘,所以该算法的总时间不能超过O(氏SUP> 4 米 2 )。这是相当奢侈的;我敢肯定,有可能是启发式(也许更详细的分析)来提高下界。

I have a set of data clustering into k groups, each cluster has a minimum size constraint of m

I've done some reclustering of the data. So now I got this set of points that each one has one or more better clusters to be in, but cannot be switched individually because it'll violate the size constraint.

Goal: minimize the sum of distance from each point to its cluster center.

Subject to: Minimum cluster size m

I want to find an algorithm to reassign all points without violating the constraint, while guaranteed to decrease the objective.

I thought of using Graph to represent pair wise relationship between points. But I'm not sure how to do the reassignment since it exists the possibility of a big dense cycle, and I got lost in swapping multiple points between multiple clusters.

I also created a list of cluster pairs with possible swapping candidates, but still couldn't find the way to optimal the objective.

I hope I explained my situation. I'm new to algorithm, and not familiar with the jargon and rules. If any other information needed, please let me know.

I've done a lot of research, I've tried out the algorithm in this paper, but without success, since the sum of membership degree doesn't necessarily correlate to cluster size. Clustering with Size Constraints

I've also read other similar posts on SO, but didn't find a detail algorithm I could implement.

I've tried to construct a weighted directed graph, with vertex representing clusters and edges from A to B represents points in cluster A willing to relocate to cluster B. and weight to be the sum of points

But with my data, it turns out all the nodes to be in a huge cycle with very dense edges. Because of my limited experience, I still could not figure out how to reassign among so many clusters. Any suggestions are appreciated!

Something like this.

解决方案

To get a minimal (unfortunately not minimum) solution:

  1. First, greedily recluster any points that you can without violating the minimum size constraint.
  2. Next, build a directed multigraph as follows:

    1. Every cluster becomes a node.
    2. An edge (A,B) is added for every point in A that is closer to the center of B (so that there are potentially multiple edges between the same pair of nodes); its weight should be proportional to the benefit gained by moving it.

  3. Looking for cycles in this graph will allow you to find moves (where a move consists of moving every vertex in the cycle).
  4. Pick the cycle with the highest total weight and recluster the nodes corresponding to the edges.
  5. Repeat steps 1-4 until there are no more cycles.

Creating the graph will have a complexity of O(kn), where you have k and n total points, and can create the same number of multiedges. Tarjan's algorithm will have complexity of O(k2), assuming that you skip multiedges to the same destination in the DFS. Every time you eliminate a cycle, you decrease the global distance by some amount and remove at least one edge from the graph, so the total time of the algorithm cannot exceed O(k4m2). That is quite extravagant; I'm sure there could be heuristics (and probably more detailed analysis) to improve the lower bound.

这篇关于聚类算法与最小尺寸的限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆