如何设置Spark Kmeans初始中心 [英] how to set Spark Kmeans initial centers

查看:329
本文介绍了如何设置Spark Kmeans初始中心的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Spark ML运行Kmeans.我有大量数据和三个现有中心,例如,三个中心是:[1.0,1.0,1.0],[5.0,5.0,5.0],[9.0,9.0,9.0]. 因此,我如何指示Kmeans中心是上述三个向量. 我看到Kmean对象具有种子参数,但是种子参数是长类型而不是数组.因此,如何告诉Spark Kmeans仅使用现有的中心进行聚类.

I'm using Spark ML for run Kmeans. I have bunch of data and three existing centers, for example the three centers are:[1.0,1.0,1.0],[5.0,5.0,5.0],[9.0,9.0,9.0]. So how can I indicate the Kmeans centers are the above three vectors. I saw Kmean object has seed parameter, but the seed parameter is an long type not an array. So how can I tell Spark Kmeans to only use the existing centers for clustering.

或者说,我不明白种子在Spark Kmeans中的含义,我认为种子应该是一组向量,它们代表在进行聚类之前指定的中心.

Or say, I didn't understand what does seed mean in Spark Kmeans, I suppose the seeds should be an array of vectors which represents the specified centers before running clustering.

推荐答案

实际上,seed并不表示您的想法,即,它不用于播种"(初始化)集群中心,而只是用于设置随机种子-您可以在

Indeed, seed does not mean what you think, i.e. it is not used for 'seeding' (initializing) the cluster centers, but simply for setting the random seed - you can confirm this in the documentation for the Scala and Python APIs.

据我所知,目前(Spark 2.1)无法为Spark ML中的k均值提供初始聚类中心(请参阅 initMode 参数,根据文档:

To the best of my knowledge, there is currently (Spark 2.1) no way for supplying initial cluster centers for k-means in Spark ML (see this answer for Spark MLlib). The initMode parameter, according to the documentation:

可以是随机"以选择随机点作为初始聚类中心,也可以是"k-means ||"使用k-means ++的并行变体')

can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++')

这篇关于如何设置Spark Kmeans初始中心的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆