使用星火LogisticRegressionWithLBFGS用于多类分类predictions概率 [英] Probability of predictions using Spark LogisticRegressionWithLBFGS for multiclass classification

查看:2214
本文介绍了使用星火LogisticRegressionWithLBFGS用于多类分类predictions概率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 LogisticRegressionWithLBFGS()培训与多个类的模型。

I am using LogisticRegressionWithLBFGS() to train a model with multiple classes.

mllib 的文件被写入了 clearThreshold()只能如果分类是二进制的使用。有没有在模型中使用的多类分类类似的东西,以输出每个类的概率在给定的输入方式?

From the documentation in mllib it is written that clearThreshold() can be used only if the classification is binary. Is there a way to use something similar for multiclass classification in order to output the probabilities of each class in a given input in the model?

推荐答案

有两种方法可以做到这一点。一是创建一个假设 predictPoint 的<一个责任的方法href=\"https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala\"相对=nofollow> LogisticRegression.scala

There are two ways to accomplish this. One is to create a method that assumes the responsibility of predictPoint in LogisticRegression.scala

object ClassificationUtility {
  def predictPoint(dataMatrix: Vector, model: LogisticRegressionModel):
    (Double, Array[Double]) = {
    require(dataMatrix.size == model.numFeatures)
    val dataWithBiasSize: Int = model.weights.size / (model.numClasses - 1)
    val weightsArray: Array[Double] = model.weights match {
      case dv: DenseVector => dv.values
      case _ =>
        throw new IllegalArgumentException(s"weights only supports dense vector but got type ${model.weights.getClass}.")
    }
    var bestClass = 0
    var maxMargin = 0.0
    val withBias = dataMatrix.size + 1 == dataWithBiasSize
    val classProbabilities: Array[Double] = new Array[Double (model.numClasses)
    (0 until model.numClasses - 1).foreach { i =>
      var margin = 0.0
      dataMatrix.foreachActive { (index, value) =>
      if (value != 0.0) margin += value * weightsArray((i * dataWithBiasSize) + index)
      }
      // Intercept is required to be added into margin.
      if (withBias) {
        margin += weightsArray((i * dataWithBiasSize) + dataMatrix.size)
      }
      if (margin > maxMargin) {
        maxMargin = margin
        bestClass = i + 1
      }
      classProbabilities(i+1) = 1.0 / (1.0 + Math.exp(-margin))
    }
    return (bestClass.toDouble, classProbabilities)
  }
}

请注意,只有从原来的方法略有不同,它只是计算物流作为输入要素的功能。它还定义了一些丘壑和VAR是最初的私人和包括这种方法之外。最终,它索引得分数组,并返回它的最佳答案一起。我把我的方法,像这样:

Note it is only slightly different from the original method, it just calculates the logistic as a function of the input features. It also defines some vals and vars that are originally private and included outside of this method. Ultimately, it indexes the scores in an Array and returns it along with the best answer. I call my method like so:

// Compute raw scores on the test set.
val predictionAndLabelsAndProbabilities = test
  .map { case LabeledPoint(label, features) =>
val (prediction, probabilities) = ClassificationUtility
  .predictPoint(features, model)
(prediction, label, probabilities)}

但是:

看来星火贡献者劝阻赞成ML的使用MLlib的。在ML回归API目前不支持多类分类。我现在用的 OneVsRest 它作为一个VS所有类别的包装。可以通过模型迭代获得的原始分数

It seems the Spark contributors are discouraging the use of MLlib in favor of ML. The ML logistic regression API currently does not support multi-class classification. I am now using OneVsRest which acts as a wrapper for one vs all classification. You can obtain the raw scores by iterating through the models:

val lr = new LogisticRegression().setFitIntercept(true)
val ovr = new OneVsRest()
ovr.setClassifier(lr)
val ovrModel = ovr.fit(training)
ovrModel.models.zipWithIndex.foreach {
  case (model: LogisticRegressionModel, i: Int) =>
    model.save(s"model-${model.uid}-$i")
}

val model0 = LogisticRegressionModel.load("model-logreg_457c82141c06-0")
val model1 = LogisticRegressionModel.load("model-logreg_457c82141c06-1")
val model2 = LogisticRegressionModel.load("model-logreg_457c82141c06-2")

现在,你有个别机型,您可以通过计算原始prediction的乙状结肠获得概率

Now that you have the individual models, you can obtain the probabilities by calculating the sigmoid of the rawPrediction

def sigmoid(x: Double): Double = {
  1.0 / (1.0 + Math.exp(-x))
}

val newPredictionAndLabels0 = model0.transform(newRescaledData)
  .select("prediction", "rawPrediction")
  .map(row => (row.getDouble(0),
    sigmoid(row.getAs[org.apache.spark.mllib.linalg.DenseVector](1).values(1)) ))
newPredictionAndLabels0.foreach(println)

val newPredictionAndLabels1 = model1.transform(newRescaledData)
  .select("prediction", "rawPrediction")
  .map(row => (row.getDouble(0),
    sigmoid(row.getAs[org.apache.spark.mllib.linalg.DenseVector](1).values(1)) ))
newPredictionAndLabels1.foreach(println)

val newPredictionAndLabels2 = model2.transform(newRescaledData)
  .select("prediction", "rawPrediction")
  .map(row => (row.getDouble(0),
    sigmoid(row.getAs[org.apache.spark.mllib.linalg.DenseVector](1).values(1)) ))
newPredictionAndLabels2.foreach(println)

这篇关于使用星火LogisticRegressionWithLBFGS用于多类分类predictions概率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆