如何使二元现象的分类在星火ML没有StringIndexer [英] How to make binary classication in Spark ML without StringIndexer

查看:466
本文介绍了如何使二元现象的分类在星火ML没有StringIndexer的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用星火ML DecisionTreeClassifier在没有StringIndexer管道,因为我的功能已经被索引为(0.0 1.0)。 DecisionTreeClassifier作为标签,需要双精度值,所以这个code应该工作:

 高清trainDecisionTreeModel(培训:RDD [LabeledPoint],SQLC:SQLContext):单位= {
  进口sqlc.implicits._
  VAL trainingDF = training.toDF()
  //这个数据帧的格式:[标签:双,特点:矢量]  VAL featureIndexer =新VectorIndexer()
    .setInputCol(特征)
    .setOutputCol(indexedFeatures)
    .setMaxCategories(4)
    .fit(trainingDF)  VAL DT =新DecisionTreeClassifier()
    .setLabelCol(标签)
    .setFeaturesCol(indexedFeatures)
  VAL管道=新管道()
    .setStages(阵列(featureIndexer,DT))
  pipeline.fit(trainingDF)
}

但实际上我得到

  java.lang.IllegalArgumentException异常:
DecisionTreeClassifier给予无效的标签栏标签输入,
无指定类的数量。见StringIndexer。

当然,我可以把StringIndexer,让他使这对我的双标签字段中的工作,但我想DecisionTreeClassifier输出原材料prediction列工作得到0.0和1.0的概率为每行像...

  VAL predictions = model.transform(singletonDF)
VAL zeroProbability = predictions.select(原始prediction)。asInstanceOf [向量](0)
VAL oneProbability = predictions.select(原始prediction)。asInstanceOf [向量](1)

如果我把StringIndexer在管道 - 我不知道我输入的指标标签0.0和1.0,在原prediction载体,因为字符串索引将由值的频率,它可以改变指数

请,有利于为DecisionTreeClassifier prepare数据,而无需使用StringIndexer或建议一些其他的方式来获得我原来的标签(0.0 1.0)的概率。对于每行


解决方案

您总是可以手动设置所需的元数据:

 进口sqlContext.implicits._
进口org.apache.spark.ml.attribute.NominalAttributeVAL元= NominalAttribute
  .defaultAttr
  .withName(标签)
  .withValues​​(0.0,1.0)
  .toMetadataVAL dfWithMeta = df.withColumn(标签$标签。作为(标签,元))
pipeline.fit(dfWithMeta)

I try to use Spark ML DecisionTreeClassifier in Pipeline without StringIndexer, because my feature is already indexed as (0.0; 1.0). DecisionTreeClassifier as label requires double values, so this code should work:

def trainDecisionTreeModel(training: RDD[LabeledPoint], sqlc: SQLContext): Unit = {
  import sqlc.implicits._
  val trainingDF = training.toDF()
  //format of this dataframe: [label: double, features: vector]

  val featureIndexer = new VectorIndexer()
    .setInputCol("features")
    .setOutputCol("indexedFeatures")
    .setMaxCategories(4)
    .fit(trainingDF)

  val dt = new DecisionTreeClassifier()
    .setLabelCol("label")
    .setFeaturesCol("indexedFeatures")


  val pipeline = new Pipeline()
    .setStages(Array(featureIndexer, dt))
  pipeline.fit(trainingDF)
}

But actually I get

java.lang.IllegalArgumentException:
DecisionTreeClassifier was given input with invalid label column label,
without the number of classes specified. See StringIndexer.

Of course I can just put StringIndexer and let him make it's work for my double "label" field, but I want to work with output rawPrediction column of DecisionTreeClassifier to get probability of 0.0 and 1.0 for each row like...

val predictions = model.transform(singletonDF) 
val zeroProbability = predictions.select("rawPrediction").asInstanceOf[Vector](0)
val oneProbability = predictions.select("rawPrediction").asInstanceOf[Vector](1)

If I put StringIndexer in Pipeline - I will not know indexes of my input labels "0.0" and "1.0" in rawPrediction vector, because String indexer will index by value's frequency, which could vary.

Please, help to prepare data for DecisionTreeClassifier without using StringIndexer or suggest some another way to get probability of my original labels (0.0; 1.0) for each row.

解决方案

You can always set required metadata manually:

import sqlContext.implicits._
import org.apache.spark.ml.attribute.NominalAttribute

val meta = NominalAttribute
  .defaultAttr
  .withName("label")
  .withValues("0.0", "1.0")
  .toMetadata

val dfWithMeta = df.withColumn("label", $"label".as("label", meta))
pipeline.fit(dfWithMeta)

这篇关于如何使二元现象的分类在星火ML没有StringIndexer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆