Spark KMeans 聚类:获取分配给聚类的样本数 [英] Spark KMeans clustering: get the number of sample assigned to a cluster

查看:52
本文介绍了Spark KMeans 聚类:获取分配给聚类的样本数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 Spark Mlib 进行 kmeans 聚类.我有一组向量,我想从中确定最可能的聚类中心.所以我将在这个集合上运行 kmeans 聚类训练,并选择分配给它的向量数量最多的集群.

I am using Spark Mlib for kmeans clustering. I have a set of vectors from which I want to determine the most likely cluster center. So I will run kmeans clustering training on this set and select the cluster with the highest number of vector assigned to it.

因此我需要知道训练后分配给每个集群的向量数量(即 KMeans.run(...)).但是我找不到从 KMeanModel 结果中检索此信息的方法.我可能需要对所有训练向量运行 predict 并计算出现最多的标签.

Therefore I need to know the number of vectors assigned to each cluster after training (i.e KMeans.run(...)). But I can not find a way to retrieve this information from KMeanModel result. I probably need to run predict on all training vectors and count the label which appear the most.

还有其他方法可以做到这一点吗?

Is there another way to do this?

谢谢

推荐答案

你说得对,这个信息不是由模型提供的,你必须运行 predict.这是以并行方式执行此操作的示例(Spark v. 1.5.1):

You are right, this info is not provided by the model, and you have to run predict. Here is an example of doing so in a parallelized way (Spark v. 1.5.1):

 from pyspark.mllib.clustering import KMeans
 from numpy import array
 data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0, 10.0, 9.0]).reshape(5, 2)
 data
 # array([[  0.,   0.],
 #       [  1.,   1.],
 #       [  9.,   8.],
 #       [  8.,   9.],
 #       [ 10.,   9.]])

 k = 2 # no. of clusters
 model = KMeans.train(
                sc.parallelize(data), k, maxIterations=10, runs=30, initializationMode="random",
                seed=50, initializationSteps=5, epsilon=1e-4)

 cluster_ind = model.predict(sc.parallelize(data))
 cluster_ind.collect()
 # [1, 1, 0, 0, 0]

cluster_ind 是一个与我们的初始数据具有相同基数的 RDD,它显示了每个数据点属于哪个集群.所以,这里我们有两个集群,一个有 3 个数据点(集群 0),一个有 2 个数据点(集群 1).请注意,我们以并行方式(即在 RDD 上)运行预测方法 - collect() 此处仅用于演示目的,在真实"情况下不需要.

cluster_ind is an RDD of the same cardinality with our initial data, and it shows which cluster each datapoint belongs to. So, here we have two clusters, one with 3 datapoints (cluster 0) and one with 2 datapoints (cluster 1). Notice that we have run the prediction method in a parallel fashion (i.e. on an RDD) - collect() is used here only for our demonstration purposes, and it is not needed in a 'real' situation.

现在,我们可以通过

 cluster_sizes = cluster_ind.countByValue().items()
 cluster_sizes
 # [(0, 3), (1, 2)]

由此,我们可以得到最大簇索引 &大小为

From this, we can get the maximum cluster index & size as

 from operator import itemgetter
 max(cluster_sizes, key=itemgetter(1))
 # (0, 3)

即我们最大的集群是集群 0,大小为 3 个数据点,可以通过检查上面的 cluster_ind.collect() 轻松验证.

i.e. our biggest cluster is cluster 0, with a size of 3 datapoints, which can be easily verified by inspection of cluster_ind.collect() above.

这篇关于Spark KMeans 聚类:获取分配给聚类的样本数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆