Spark KMeans聚类:获取分配给集群的样本数 [英] Spark KMeans clustering: get the number of sample assigned to a cluster

查看:128
本文介绍了Spark KMeans聚类:获取分配给集群的样本数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Spark Mlib进行kmeans集群.我有一组向量,可以根据这些向量确定最可能的聚类中心.因此,我将在此集合上进行kmeans聚类训练,并选择分配有最多矢量的聚类.

I am using Spark Mlib for kmeans clustering. I have a set of vectors from which I want to determine the most likely cluster center. So I will run kmeans clustering training on this set and select the cluster with the highest number of vector assigned to it.

因此,我需要知道训练后分配给每个聚类的向量数量(即KMeans.run(...)).但是我找不到从KMeanModel结果中检索此信息的方法.我可能需要在所有训练向量上运行predict并计算出现最多的标签.

Therefore I need to know the number of vectors assigned to each cluster after training (i.e KMeans.run(...)). But I can not find a way to retrieve this information from KMeanModel result. I probably need to run predict on all training vectors and count the label which appear the most.

还有另一种方法吗?

谢谢

推荐答案

您是正确的,模型未提供此信息,因此您必须运行predict.这是一个以并行方式执行此操作的示例(Spark 1.5.1版):

You are right, this info is not provided by the model, and you have to run predict. Here is an example of doing so in a parallelized way (Spark v. 1.5.1):

 from pyspark.mllib.clustering import KMeans
 from numpy import array
 data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0, 10.0, 9.0]).reshape(5, 2)
 data
 # array([[  0.,   0.],
 #       [  1.,   1.],
 #       [  9.,   8.],
 #       [  8.,   9.],
 #       [ 10.,   9.]])

 k = 2 # no. of clusters
 model = KMeans.train(
                sc.parallelize(data), k, maxIterations=10, runs=30, initializationMode="random",
                seed=50, initializationSteps=5, epsilon=1e-4)

 cluster_ind = model.predict(sc.parallelize(data))
 cluster_ind.collect()
 # [1, 1, 0, 0, 0]

cluster_ind是与我们的初始数据具有相同基数的RDD,它显示每个数据点所属的群集.因此,这里有两个集群,一个集群具有3个数据点(集群0),另一个集群具有2个数据点(集群1).请注意,我们已经以并行方式(即在RDD上)运行预测方法-此处collect()仅用于演示目的,在``真实''情况下不需要.

cluster_ind is an RDD of the same cardinality with our initial data, and it shows which cluster each datapoint belongs to. So, here we have two clusters, one with 3 datapoints (cluster 0) and one with 2 datapoints (cluster 1). Notice that we have run the prediction method in a parallel fashion (i.e. on an RDD) - collect() is used here only for our demonstration purposes, and it is not needed in a 'real' situation.

现在,我们可以通过以下方式获取簇的大小

Now, we can get the cluster sizes with

 cluster_sizes = cluster_ind.countByValue().items()
 cluster_sizes
 # [(0, 3), (1, 2)]

由此,我们可以获得最大的群集索引&大小为

From this, we can get the maximum cluster index & size as

 from operator import itemgetter
 max(cluster_sizes, key=itemgetter(1))
 # (0, 3)

即我们最大的群集是群集0,其大小为3个数据点,可以通过查看上面的cluster_ind.collect()轻松验证.

i.e. our biggest cluster is cluster 0, with a size of 3 datapoints, which can be easily verified by inspection of cluster_ind.collect() above.

这篇关于Spark KMeans聚类:获取分配给集群的样本数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆