Spark中的分层聚集聚类 [英] Hierarchical Agglomerative clustering in Spark

查看:315
本文介绍了Spark中的分层聚集聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究集群问题,它必须可扩展以处理大量数据.我想在Spark中尝试分层聚类,并将我的结果与其他方法进行比较.

我已经在网络上进行了一些有关将分层集群与Spark结合使用的研究,但没有发现任何有希望的信息.

如果有人对此有所了解,我将不胜感激. 谢谢.

解决方案

平分均方根法

似乎做得不错,并且在性能方面运行得非常快.这是我编写的示例代码,用于利用Spark(scala)中的Bisecting-Kmeans算法从Iris数据集(许多人都熟悉)中获取聚类中心.注意:(我在大多数Spark工作中都使用Spark-Notebook,它与Jupyter Notebooks非常相似).之所以提出这一点,是因为您需要创建一个Spark SQLContext才能使此示例正常工作,根据您访问Spark的位置或方式,该示例可能会有所不同.

您可以下载Iris.csv来测试此处

您可以在此处下载

这是一个很棒的工具,可以轻松地运行独立的Spark集群.如果您需要有关Linux或Mac的帮助,我可以提供说明.下载后,您需要使用SBT进行编译...从基本目录sbt,然后从run

使用以下命令

可以在localhost:9000上访问

必需的进口

 import org.apache.spark.sql.types._
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.clustering.BisectingKMeans
 

在Spark-Notebook中创建sqlContext的方法

 import org.apache.spark.sql.SQLContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 

定义导入架构

 val customSchema = StructType(Array(
StructField("c0", IntegerType, true),
StructField("Sepal_Length", DoubleType, true),
StructField("Sepal_Width", DoubleType, true),
StructField("Petal_Length", DoubleType, true),
StructField("Petal_Width", DoubleType, true),
StructField("Species", StringType, true)))
 

制作DF

 val iris_df = sqlContext.read
.format("csv")
.option("header", "true") //reading the headers
.option("mode", "DROPMALFORMED")
.schema(customSchema)
.load("/your/path/to/iris.csv")
 

指定功能

 val assembler = new 
VectorAssembler().setInputCols(Array("c0","Sepal_Length", "Sepal_Width","Petal_Length","Petal_Width")).setOutputCol("features")
val iris_df_trans = assembler.transform(iris_df)
 

具有3个簇的模型(更改为.setK)

 val bkm = new BisectingKMeans().setK(3).setSeed(1L).setFeaturesCol("features")
val model = bkm.fit(iris_df_trans)
 

计算成本

 val cost = model.computeCost(iris_df_trans)
 

计算中心

 println(s"Within Set Sum of Squared Errors = $cost")
println("Cluster Centers: ")
val centers = model.clusterCenters
centers.foreach(println)
 

一种凝聚方法

以下内容提供了Spark中的聚集层次聚类实现,值得一看,它不像平分Kmeans方法那样包含在基本MLlib中,并且我没有示例.但是值得那些好奇的人看看.

Github项目

在Spark-Summit上的YouTube展示

Spark-Summit中的幻灯片

I am working on a clustering problem and it has to be scalable for a lot of data. I would like to try hierarchical clustering in Spark and compare my results with other methods.

I have done some research on the web about using hierarchical clustering with Spark but haven't found any promising information.

If anyone has some insight about it, I would be very grateful. Thank you.

解决方案

The Bisecting Kmeans Approach

Seems to do a decent job, and runs quite fast in terms of performance. Here is a sample code I wrote for utilizing the Bisecting-Kmeans algorithm in Spark (scala) to get cluster centers from the Iris Data Set (which many people are familiar with). Note: (I use Spark-Notebook for most of my Spark work, it is very similar to Jupyter Notebooks). I bring this up because you will need to create a Spark SQLContext for this example to work, which may differ based on where or how you are accessing Spark.

You can download the Iris.csv to test here

You can download Spark-Notebook here

It is a great tool, which will easily allow you to run a standalone spark cluster. If you want help with it on linux or Mac, I can provide instructions. Once you download it you need to use SBT to compile it... Use the following commands from the base directory sbt, then run

It will be accessible at localhost:9000

Required Imports

import org.apache.spark.sql.types._
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.clustering.BisectingKMeans

Method to create sqlContext in Spark-Notebook

import org.apache.spark.sql.SQLContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

Defining Import Schema

val customSchema = StructType(Array(
StructField("c0", IntegerType, true),
StructField("Sepal_Length", DoubleType, true),
StructField("Sepal_Width", DoubleType, true),
StructField("Petal_Length", DoubleType, true),
StructField("Petal_Width", DoubleType, true),
StructField("Species", StringType, true)))

Making the DF

val iris_df = sqlContext.read
.format("csv")
.option("header", "true") //reading the headers
.option("mode", "DROPMALFORMED")
.schema(customSchema)
.load("/your/path/to/iris.csv")

Specifying features

val assembler = new 
VectorAssembler().setInputCols(Array("c0","Sepal_Length", "Sepal_Width","Petal_Length","Petal_Width")).setOutputCol("features")
val iris_df_trans = assembler.transform(iris_df)

Model with 3 Clusters (change with .setK)

val bkm = new BisectingKMeans().setK(3).setSeed(1L).setFeaturesCol("features")
val model = bkm.fit(iris_df_trans)

Computing cost

val cost = model.computeCost(iris_df_trans)

Calculating Centers

println(s"Within Set Sum of Squared Errors = $cost")
println("Cluster Centers: ")
val centers = model.clusterCenters
centers.foreach(println)

An Agglomerative Approach

The following provides an Agglomerative hierarchical clustering implementation in Spark which is worth a look, it is not included in the base MLlib like the bisecting Kmeans method and I do not have an example. But it is worth a look for those curious.

Github Project

Youtube of Presentation at Spark-Summit

Slides from Spark-Summit

这篇关于Spark中的分层聚集聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆