如何设置Spark Kmeans初始中心 [英] how to set Spark Kmeans initial centers
问题描述
我正在使用Spark ML运行Kmeans.我有大量数据和三个现有中心,例如,三个中心是:[1.0,1.0,1.0],[5.0,5.0,5.0],[9.0,9.0,9.0].
因此,我如何指示Kmeans中心是上述三个向量.
我看到Kmean对象具有种子参数,但是种子参数是长类型而不是数组.因此,如何告诉Spark Kmeans仅使用现有的中心进行聚类.
I'm using Spark ML for run Kmeans. I have bunch of data and three existing centers, for example the three centers are:[1.0,1.0,1.0],[5.0,5.0,5.0],[9.0,9.0,9.0].
So how can I indicate the Kmeans centers are the above three vectors.
I saw Kmean object has seed parameter, but the seed parameter is an long type not an array. So how can I tell Spark Kmeans to only use the existing centers for clustering.
或者说,我不明白种子在Spark Kmeans中的含义,我认为种子应该是一组向量,它们代表在进行聚类之前指定的中心.
Or say, I didn't understand what does seed mean in Spark Kmeans, I suppose the seeds should be an array of vectors which represents the specified centers before running clustering.
推荐答案
实际上,seed
并不表示您的想法,即,它不用于播种"(初始化)集群中心,而只是用于设置随机种子-您可以在 Python API.
Indeed, seed
does not mean what you think, i.e. it is not used for 'seeding' (initializing) the cluster centers, but simply for setting the random seed - you can confirm this in the documentation for the Scala and Python APIs.
据我所知,目前(Spark 2.1)无法为Spark ML中的k均值提供初始聚类中心(请参阅 initMode
参数,根据文档:
To the best of my knowledge, there is currently (Spark 2.1) no way for supplying initial cluster centers for k-means in Spark ML (see this answer for Spark MLlib). The initMode
parameter, according to the documentation:
可以是随机"以选择随机点作为初始聚类中心,也可以是"k-means ||"使用k-means ++的并行变体')
can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++')
这篇关于如何设置Spark Kmeans初始中心的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!