是否可以在spark.ml管道中访问estimator属性? [英] Is it possible to access estimator attributes in spark.ml pipelines?

查看:112
本文介绍了是否可以在spark.ml管道中访问estimator属性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Spark 1.5.1中有一个spark.ml管道,该管道由一系列转换器和一个k均值估计器组成.我希望能够访问 KMeansModel .clusterCenters居中,但无法确定具体方法.是否有与sklearn的pipeline.named_steps功能等效的spark.ml?

I have a spark.ml pipeline in Spark 1.5.1 which consists of a series of transformers followed by a k-means estimator. I want to be able to access the KMeansModel.clusterCenters after fitting the pipeline, but can't figure out how. Is there a spark.ml equivalent of sklearn's pipeline.named_steps feature?

我发现了此答案,其中提供了两种选择.如果我将k-means模型从管道中取出并单独进行拟合,则第一个方法可行,但这有点违反了管道的目的.第二个选项不起作用-我得到error: value getModel is not a member of org.apache.spark.ml.PipelineModel.

I found this answer which gives two options. The first works if I take the k-means model out of my pipeline and fit it separately, but that kinda defeats the purpose of a pipeline. The second option doesn't work - I get error: value getModel is not a member of org.apache.spark.ml.PipelineModel.

管道示例:

import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
import org.apache.spark.ml.Pipeline

// create example dataframe
val sentenceData = sqlContext.createDataFrame(Seq(
  ("Hi I heard about Spark"),
  ("I wish Java could use case classes"),
  ("K-means models are neat")
  )).toDF("sentence")

// initialize pipeline stages
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("features").setNumFeatures(20)
val kmeans = new KMeans()
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, kmeans))

// fit the pipeline
val fitKmeans = pipeline.fit(sentenceData)

因此,现在fitKmeans的类型为org.apache.spark.ml.PipelineModel.我的问题是,如何访问该管道中包含的k均值模型计算出的聚类中心?如上所述,当包含在管道中时,可以使用fitKmeans.clusterCenters来完成.

So now fitKmeans is of type org.apache.spark.ml.PipelineModel. My question is, how do I access the cluster centers calculated by the k-means model contained within this pipeline? As noted above, when not contained in a pipeline, this can be done with fitKmeans.clusterCenters.

推荐答案

回答我自己的问题...我终于偶然发现了spark.ml文档深处的一个示例,该示例演示了如何使用stages成员来执行此操作PipelineModel类.因此,对于我上面发布的示例,为了访问k均值聚类中心,请执行以下操作:

Answering my own question...I finally stumbled on an example deep in the spark.ml docs that shows how to do this using the stages member of the PipelineModel class. So for the example I posted above, in order to access the k-means cluster centers, do:

val centers = fitKmeans.stages(2).asInstanceOf[KMeansModel].clusterCenters

其中fitKmeans是PipelineModel,2是k-means模型在管线级数组中的索引.

where fitKmeans is a PipelineModel and 2 is the index of the k-means model in the array of pipeline stages.

参考:此页面上的大多数示例的最后一行.

Reference: the last line of most of the examples on this page.

这篇关于是否可以在spark.ml管道中访问estimator属性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆