为什么不仅使用Canopy集群而不是与KMeans Mahout结合使用 [英] Why not use just Canopy clustering instead of combining with KMeans Mahout

查看:95
本文介绍了为什么不仅使用Canopy集群而不是与KMeans Mahout结合使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

标题中的问题是-如果Canopy可用于聚类以及确定质心,为什么不将其用于聚类,而不是仅将其用于生成质心作为KMeans聚类的输入?

The question is in the title - if Canopy can be used for clustering, as well as for determining centroids, why not use it for clustering, instead of using it just to generate centroids as input for KMeans clustering?

我正在考虑使用Mahout进行实现,但是我认为这只是一个概念,与系统关系不大.

I'm considering implementation using Mahout, but I think that this is more a concept, not too much related to system.

谢谢

推荐答案

Mahout不推荐使用Canopy,因此我完全不会使用它.

Canopy is deprecated from Mahout so I wouldn't use it at all.

它是快速的,因此其想法是比随机估计起始质心更快,从而使kmeans收敛更快.

It is fast so the idea was to make a quick better than random estimate of starting centroids so that kmeans converged quicker.

Canopy没有收敛标准,因此您只能得到的第一个猜测. Kmeans遵循称为梯度下降的算法进行迭代,以找到已定义误差函数的局部最小值.因此,它趋向于更好的猜测,但是通常您从一个随机质心开始,希望它放置正确.机盖试图将起始质心更好地放置,但效果远不如随机性好.

Canopy has no convergence criteria so it's first guess is all you get. Kmeans iterates following an algorithm called gradient descent to find local minimums of the defined error function. So it converges towards better guesses but generally you start from a random centroid hoping that it was placed well. Canopy was an attempt to place the starting centroid better but did not work much if at all better than random.

因此,您可以通过遍历所有向量并找到它们最接近哪个树冠质心来进行Canopy的猜测并计算聚类,但是这些聚类将无法获得迭代的好处,并且在交叉验证测试中得分会更差.

So you could just take Canopy's guess and calculate clusters by going through all vectors and finding which canopy centroid they were closest to but the clusters would not have the benefit of iteration and would score worse on cross validation tests.

这篇关于为什么不仅使用Canopy集群而不是与KMeans Mahout结合使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆