使用树输出预测Spark中的梯度增强树时的类概率 [英] Predicting probabilities of classes in case of Gradient Boosting Trees in Spark using the tree output

查看:105
本文介绍了使用树输出预测Spark中的梯度增强树时的类概率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

众所周知,到目前为止,Spark中的GBT可以为您提供预测的标签.

It is known that GBT s in Spark gives you predicted labels as of now.

我当时正在考虑尝试计算一个类的预测概率(例如,所有实例都落在某个叶子下)

I was thinking of trying to calculate predicted probabilities for a class (say all the instances falling under a certain leaf)

构建GBT的代码

import org.apache.spark.SparkContext
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils

//Importing the data
val data = sc.textFile("data/mllib/credit_approval_2_attr.csv") //using the credit approval data set from UCI machine learning repository

//Parsing the data
val parsedData = data.map { line =>
    val parts = line.split(',').map(_.toDouble)
    LabeledPoint(parts(0), Vectors.dense(parts.tail))
}

//Splitting the data
val splits = parsedData.randomSplit(Array(0.7, 0.3), seed = 11L)
val training = splits(0).cache() 
val test = splits(1)

// Train a GradientBoostedTrees model.
// The defaultParams for Classification use LogLoss by default.
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 2 // We can use more iterations in practice.
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 2
boostingStrategy.treeStrategy.maxBins = 32
boostingStrategy.treeStrategy.subsamplingRate = 0.5
boostingStrategy.treeStrategy.maxMemoryInMB =1024
boostingStrategy.learningRate = 0.1

// Empty categoricalFeaturesInfo indicates all features are continuous.
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()

val model = GradientBoostedTrees.train(training, boostingStrategy)  

model.toDebugString

为简单起见,这为我提供了2个深度为2的树:

This gives me 2 trees of depth 2 as below for simplicity:

 Tree 0:
    If (feature 3 <= 2.0)
     If (feature 2 <= 1.25)
      Predict: -0.5752212389380531
     Else (feature 2 > 1.25)
      Predict: 0.07462686567164178
    Else (feature 3 > 2.0)
     If (feature 0 <= 30.17)
      Predict: 0.7272727272727273
     Else (feature 0 > 30.17)
      Predict: 1.0
  Tree 1:
    If (feature 5 <= 67.0)
     If (feature 4 <= 100.0)
      Predict: 0.5739387416147804
     Else (feature 4 > 100.0)
      Predict: -0.550117566730937
    Else (feature 5 > 67.0)
     If (feature 2 <= 0.0)
      Predict: 3.0383669122382835
     Else (feature 2 > 0.0)
      Predict: 0.4332824083446489

我的问题是:我可以使用上述树来计算预测概率吗?

My question is: Can I use the above trees to calculate predicted probabilities like:

关于用于预测的功能集中的每个实例

With respect to every instance in the feature set used for prediction

exp(树0的叶子分数+树1的叶子分数)/(1 + exp(树0的叶子分数+树木1的叶子分数))

exp(leaf score from tree 0 + leaf score from tree 1)/(1+exp(leaf score from tree 0 + leaf score from tree 1))

这给了我一种可能性.但是不确定这是否是正确的方法.另外,是否有任何文档说明如何计算叶子分数(预测).如果有人可以分享,我将不胜感激.

This gives me a kind of probability. But not sure if it is the right way to do it. Also if there is any document explaining how leaf score (prediction) are calculated. I would be really grateful if anybody can share.

任何建议都是极好的.

Any suggestion would be superb.

推荐答案

这是我使用Spark内部依赖项的方法.您稍后将需要导入线性代数库以进行矩阵运算,即将树预测与学习率相乘.

Here is my approach using Spark internal dependencies. You will need to import the linear algebra library for the matrix operation later, i.e., multiplying the tree predictions with the learning rate.

import org.apache.spark.mllib.linalg.{Vectors, Matrices}
import org.apache.spark.mllib.linalg.distributed.{RowMatrix}

假设您使用GBT建立模型:

Say you build a model with GBT:

val model = GradientBoostedTrees.train(trainingData, boostingStrategy)

要使用模型对象计算概率,请执行以下操作:

To calculate the probability using the model object:

// Get the log odds predictions from each tree
val treePredictions = testData.map { point => model.trees.map(_.predict(point.features)) }

// Transform the arrays into matrices for multiplication
val treePredictionsVector = treePredictions.map(array => Vectors.dense(array))
val treePredictionsMatrix = new RowMatrix(treePredictionsVector)
val learningRate = model.treeWeights
val learningRateMatrix = Matrices.dense(learningRate.size, 1, learningRate)
val weightedTreePredictions = treePredictionsMatrix.multiply(learningRateMatrix)

// Calculate probability by ensembling the log odds
val classProb = weightedTreePredictions.rows.flatMap(_.toArray).map(x => 1 / (1 + Math.exp(-1 * x)))
classProb.collect

// You may tweak your decision boundary for different class labels
val classLabel = classProb.map(x => if (x > 0.5) 1.0 else 0.0)
classLabel.collect


这是一个代码段,您可以复制&直接粘贴到spark-shell:


Here is a code snippet you can copy & paste directly into spark-shell:

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.{Vectors, Matrices}
import org.apache.spark.mllib.linalg.distributed.{RowMatrix}
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel

// Load and parse the data file.
val csvData = sc.textFile("data/mllib/sample_tree_data.csv")
val data = csvData.map { line =>
  val parts = line.split(',').map(_.toDouble)
  LabeledPoint(parts(0), Vectors.dense(parts.tail))
}
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a GBT model.
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 50
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 6
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()

val model = GradientBoostedTrees.train(trainingData, boostingStrategy)

// Get class label from raw predict function
val predictedLabels = model.predict(testData.map(_.features))
predictedLabels.collect

// Get class probability
val treePredictions = testData.map { point => model.trees.map(_.predict(point.features)) }
val treePredictionsVector = treePredictions.map(array => Vectors.dense(array))
val treePredictionsMatrix = new RowMatrix(treePredictionsVector)
val learningRate = model.treeWeights
val learningRateMatrix = Matrices.dense(learningRate.size, 1, learningRate)
val weightedTreePredictions = treePredictionsMatrix.multiply(learningRateMatrix)
val classProb = weightedTreePredictions.rows.flatMap(_.toArray).map(x => 1 / (1 + Math.exp(-1 * x)))
val classLabel = classProb.map(x => if (x > 0.5) 1.0 else 0.0)
classLabel.collect

这篇关于使用树输出预测Spark中的梯度增强树时的类概率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆