如何为树冠聚类选择T1和T2阈值? [英] How to pick the T1 and T2 threshold values for Canopy Clustering?

查看:282
本文介绍了如何为树冠聚类选择T1和T2阈值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试与K-Means一起实现Canopy聚类算法.我在网上做了一些搜索,说要使用Canopy聚类来获取您的初始起点并输入到K均值中,问题是,在Canopy聚类中,您需要为该冠层指定2个阈值:T1和T2,其中内部阈值中的点与该树冠紧密相关,而较宽阈值中的点与该树冠的相关性较小.如何确定这些阈值或距树冠中心的距离?

I am trying to implement the Canopy clustering algorithm along with K-Means. I've done some searching online that says to use Canopy clustering to get your initial starting points to feed into K-means, the problem is, in Canopy clustering, you need to specify 2 threshold values for the canopy: T1 and T2, where points in the inner threshold are strongly tied to that canopy and the points in the wider threshold are less tied to that canopy. How are these threshold, or distances from the canopy center, determined?

问题上下文:

我要解决的问题是,我有一组数字,例如[1,30]或[1,250],其集合大小约为50.可以有重复的元素,并且它们可以是浮点数,例如好吧,例如8、17.5、17.5、23、66等...我想找到最佳的聚类或一组数字的子集.

The problem I'm trying to solve is, I have a set of numbers such as [1,30] or [1,250] with set sizes of about 50. There can be duplicate elements and they can be floating point numbers as well, such as 8, 17.5, 17.5, 23, 66, ... I want to find the optimal clusters, or subsets of the set of numbers.

因此,如果用K均值聚类的Canopy聚类是一个不错的选择,那么我的问题仍然存在:如何找到T1,T2值?如果这不是一个好的选择,是否有更好,更简单但有效的算法可以使用?

So, if Canopy clustering with K-means is a good choice, then my questions still stands: how do you find the T1, T2 values?. If this is not a good choice, is there a better, simpler but effective algorithm to use?

推荐答案

实际上,这是Canopy Clustering的大问题.选择阈值与实际算法几乎一样困难.特别是高尺寸.对于2D地理数据集,领域专家可以轻松定义距离阈值.但是在高维数据中,可能最好的方法是先对数据样本进行k-均值计算,然后根据该样本运行选择距离.

Actually that is the big issue with Canopy Clustering. Choosing the thresholds is pretty much as difficult as the actual algorithm. In particular in high dimensions. For a 2D geographic data set, a domain expert can probably define the distance thresholds easily. But in high-dimensional data, probably the best you can do is to run k-means on a sample of your data first, then choose the distances based on this sample run.

这篇关于如何为树冠聚类选择T1和T2阈值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆