如何在没有 StringIndexer 的 Spark ML 中进行二进制分类 [英] How to make binary classication in Spark ML without StringIndexer

查看:27
本文介绍了如何在没有 StringIndexer 的 Spark ML 中进行二进制分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试在没有 StringIndexer 的 Pipeline 中使用 Spark ML DecisionTreeClassifier,因为我的特性已经被索引为 (0.0; 1.0).DecisionTreeClassifier 作为标签需要双值,所以这段代码应该可以工作:

I try to use Spark ML DecisionTreeClassifier in Pipeline without StringIndexer, because my feature is already indexed as (0.0; 1.0). DecisionTreeClassifier as label requires double values, so this code should work:

def trainDecisionTreeModel(training: RDD[LabeledPoint], sqlc: SQLContext): Unit = {
  import sqlc.implicits._
  val trainingDF = training.toDF()
  //format of this dataframe: [label: double, features: vector]

  val featureIndexer = new VectorIndexer()
    .setInputCol("features")
    .setOutputCol("indexedFeatures")
    .setMaxCategories(4)
    .fit(trainingDF)

  val dt = new DecisionTreeClassifier()
    .setLabelCol("label")
    .setFeaturesCol("indexedFeatures")


  val pipeline = new Pipeline()
    .setStages(Array(featureIndexer, dt))
  pipeline.fit(trainingDF)
}

但实际上我明白了

java.lang.IllegalArgumentException:
DecisionTreeClassifier was given input with invalid label column label,
without the number of classes specified. See StringIndexer.

当然,我可以只放置 StringIndexer 并让他使其适用于我的双标签"字段,但我想使用 DecisionTreeClassifier 的输出 rawPrediction 列来获得每行 0.0 和 1.0 的概率,例如...

Of course I can just put StringIndexer and let him make it's work for my double "label" field, but I want to work with output rawPrediction column of DecisionTreeClassifier to get probability of 0.0 and 1.0 for each row like...

val predictions = model.transform(singletonDF) 
val zeroProbability = predictions.select("rawPrediction").asInstanceOf[Vector](0)
val oneProbability = predictions.select("rawPrediction").asInstanceOf[Vector](1)

如果我将 StringIndexer 放入 Pipeline - 我将不知道我的输入标签0.0"和1.0"在 rawPrediction 向量中的索引,因为字符串索引器将按值的频率进行索引,这可能会有所不同.

If I put StringIndexer in Pipeline - I will not know indexes of my input labels "0.0" and "1.0" in rawPrediction vector, because String indexer will index by value's frequency, which could vary.

请帮助在不使用 StringIndexer 的情况下为 DecisionTreeClassifier 准备数据,或者建议一些其他方法来获得每行原始标签的概率 (0.0; 1.0).

Please, help to prepare data for DecisionTreeClassifier without using StringIndexer or suggest some another way to get probability of my original labels (0.0; 1.0) for each row.

推荐答案

您始终可以手动设置所需的元数据:

You can always set required metadata manually:

import sqlContext.implicits._
import org.apache.spark.ml.attribute.NominalAttribute

val meta = NominalAttribute
  .defaultAttr
  .withName("label")
  .withValues("0.0", "1.0")
  .toMetadata

val dfWithMeta = df.withColumn("label", $"label".as("label", meta))
pipeline.fit(dfWithMeta)

这篇关于如何在没有 StringIndexer 的 Spark ML 中进行二进制分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆