分层集群化启发式 [英] Hierarchical clusterization heuristics

查看:180
本文介绍了分层集群化启发式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想探索在大型阵列中的数据项之间的关系。每个数据项重$ P $由多维向量psented。首先,我已经决定使用集群化。我感兴趣的是找到集群(组数据向量)之间的层次关系。我能计算出我的向量之间的距离。因此,在第一步中我发现的最小生成树的。从那以后,我需要组数据向量按照我的生成树链接。但在这一步,我感到不安 - 如何在不同的载体结合成分层聚类我使用的是启发式:? 如果两个向量的联系,它们之间的距离非常小 - 这意味着它们是在相同的群集,<强> 如果两个wectors链接但它们之间的距离大于阈值 - 这意味着它们在不同的集群共同根群集

I want to explore relations between data items in large array. Every data item represented by multidimensional vector. First of all, I've decided to use clusterization. I'm interested in finding hierarchical relations between clusters (groups of data vectors). I'm able to calculate distance between my vectors. So at the first step I'm finding minimal spanning tree. After that I need to group data vectors according to links in my spanning tree. But at this step I'm disturbed - how to combine different vectors into hierarchical clusters? I'm using heuristics: if two vectors linked, and distance between them is very small - that means that they are in the same cluster, if two wectors are linked but distance between them is larger than threshold - that means that they are in different clusters with common root cluster.

不过,也许有更好的解决办法?

But maybe there is better solution?

感谢

P.S。 谢谢大家!

事实上,我也尝试过使用K-手段和CLOPE的一些变化,但并没有得到很好的效果。

In fact I've tried to use k-means and some variation of CLOPE, but didn't get good results.

所以,现在我知道,我的簇集实际上具有复杂的结构(远远超过正球体更复杂)。

这就是为什么我要使用分层clusterisation。同样的我想这簇看起来像n维串联(如3D或2D链)。于是我使用单链路的策略。 但我感到不安 - 如何将不同集群相互结合( 在什么情况下我已经做出共同的根群,并在其中我已经情形所有子群组合在一个集群? )。 我使用这种简单的策略:

Thats why I want to use hierarchical clusterisation. Also I'm guess that clusters are looks like n-dimension concatenations (like 3d or 2d chain). So I use single-link strategy. But I'm disturbed - how to combine different clusters with each other (in which situation I've to make common root cluster, and in which situations I've to combine all sub-clusters in one cluster?). I'm using such simple strategy:

      
  • 如果集群(或载体)过于接近对方 - 我结合自己的内容放到一个集群(由阈值调节)
  •   
  • 如果集群(或载体)太远彼此 - 我创建根群集,并把它们放到它
  •   

但是,使用这种策略,我已经得到的非常大的集群树的。我试图找到满意的门槛。但是,也许有可能是更好的策略来产生群集树?

But using this strategy I've got very large cluster trees. I'm trying to find satisfactory threshold. But maybe there might be better strategy to generate cluster-tree?

下面是一个简单的图片,描述了我的问题:

Here is a simple picture, describes my question:

推荐答案

有聚类算法整个动物园。其中,最小生成树又名单机联动集群有一些很好的理论性能,指出如在 http://www.cs.uwaterloo.ca/~mackerma/Taxonomy.pdf 。特别是,如果你把一个最小生成树并清除所有链接比一些阈长度越长,则所得到的点的分组应具备剩余的链路的总长度最小为大小的任何分组,为同样的原因,Kruskal算法产生最小生成树。

There is a whole zoo of clustering algorithms. Among them, minimum spanning tree a.k.a. single linkage clustering has some nice theoretical properties, as noted e.g. at http://www.cs.uwaterloo.ca/~mackerma/Taxonomy.pdf. In particular, if you take a minimum spanning tree and remove all links longer than some threshold length, then the resulting grouping of points into clusters should have minimum total length of remaining links for any grouping of that size, for the same reason that Kruskal's algorithm produces a minimum spanning tree.

但是,不能保证最小生成树将是最适合你的特殊用途,所以我觉得你不应该写下来你真正需要你的聚类算法,然后选择一个方法此基础上,或者尝试使用各种对数据的不同聚类算法,看看哪些是最好的做法。

However, there is no guarantee that minimum spanning tree will be the best for your particular purpose, so I think you should either write down what you actually need from your clustering algorithm and then choose a method based on that, or try a variety of different clustering algorithms on your data and see which is best in practice.

这篇关于分层集群化启发式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆