KMEANS提取使用Apache星火集群信息 [英] Extract kmeans cluster information using Apache Spark

查看:131
本文介绍了KMEANS提取使用Apache星火集群信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我实现了Apache星火例如在

I've implemented the Apache Spark example at

https://spark.apache.org/docs/ 1.1.0 / mllib-clustering.html#

下面是源:

import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile("data/mllib/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble)))

// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)

// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = clusters.computeCost(parsedData)
println("Within Set Sum of Squared Errors = " + WSSSE)

使用的数据集:

0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2

我可以通过提取聚类中心:

I can extract the cluster centers using :

  println(clusters.clusterCenters.apply(0))
   println(clusters.clusterCenters.apply(1))

返回

[9.1,9.1,9.1]
[0.10000000000000002,0.10000000000000002,0.10000000000000002]

但也有一些项目我不知道,这似乎并没有受到API的支持:

But there are some items I'm not sure of, which does not seem to be supported by the API :

我怎么能提取点已经被添加了什么到两个集群的?

How can I extract what points have been added to each of the two clusters ?

如何标签添加到每个数据点,以便在观看什么点是在每个簇也能确定每个点的标签?我是否需要更新星火的Kmeans实现这个功能?

How to add labels to each data point so that while viewing what points are in each cluster can also determine each points label ? Do I need to update the Spark Kmeans implementation to achieve this ?

推荐答案

如果您使用的是Java,

if you are using java,

javaRDD cluster_indices =集群predict(parsedData);

javaRDD cluster_indices = clusters.predict(parsedData);

为predict过载。

as predict is overloaded.

这篇关于KMEANS提取使用Apache星火集群信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆